# Data Analysis Notebook

In this notebook, we merged the both datasets we have Now Corpus data and FactBook data and did the correlation analysis between these dataset and the topics generated from the LDA models.

This parts mostly cover the plots and analysis results after the topic selection process.

In [1]:
#imports
import pandas as pd
import numpy as np
from collections import Counter
import re
import csv
import seaborn as sns
import matplotlib.pylab as plt
from scipy.stats import pearsonr, spearmanr

import warnings
warnings.filterwarnings('ignore')
import os

In [2]:
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
plotly.tools.set_credentials_file(username='ezgiY', api_key='1OEMOjeLrUj3K3kUVZ0d')

In [3]:
# Read the data
topic_distribution = pd.read_csv('../Data/topic_distribution.csv')
full_data_sources = pd.read_csv('../Data/final-news-data.csv')
all_facts_org = pd.read_csv('../Data/AllFacts.csv')

The final contents of the datum are the following: 
- Topic Distribution data shows the topic distribution per country
    NAN values exist in this data because the topics assigned for each country from separate models in order to prevent from bias in the results (some countries has either way more article or extreme values that causes topics to focus on specific events and neglects other countries ie. US biases Politics topic and shows words related to US election and neglects other frequent words from other countries about politics.) After, creating a model per country, we assigned the found topics, due to the fact that in some countries we couldn't see a topic based on some news we assign them NAN values.
- Final News Data is the source data results with an extra column topic assigned for each article (we assigned the topic that has maximum probabilty for each article).
- AllFacts data is the cleaned and preprocessed version of the FactBook data.

In [4]:
# Topic Distribution data structure
topic_distribution.drop(columns=['EDUCATION/FAMILY'],inplace=True)
topic_distribution.head()

Unnamed: 0,country,country_name,ENVIRONMENT/ENERGY,INTERNATIONAL,POLITICS,SPORTS,TECHNOLOGY/SCIENCE/SOCIAL MEDIA,SOCIAL_LIFE/DAILY,ENTERTAINMENT/ART/MAGAZINE,COMPANY/BUSINESS,ECONOMY,POLICE/ACCIDENT/VIOLENCE,LEGAL/LAW,HEALTH/MEDICAL
0,US,United States,0.022056,,0.194978,0.113442,0.082864,0.13251,0.055243,0.084825,0.084809,0.084572,0.056618,0.088082
1,CA,Canada,0.194973,,,0.09584,,0.1466,0.073039,0.073039,0.138273,0.165221,,0.113016
2,GB,United Kingdom,,0.175466,,0.22891,,0.057249,0.147708,0.148592,0.037203,,0.121004,0.083868
3,IE,Ireland,,,0.177962,0.252041,,0.054292,0.184377,,0.143317,,0.13386,0.054152
4,AU,Australia,,0.029435,,0.124271,0.163525,0.112324,0.153065,,0.157716,,0.117698,0.141965


In [5]:
# Final News Data structure
full_data_sources.head()

Unnamed: 0,textID,#words,date,country,website,url,title,Topics
0,3732490,815,2015-11-01,US,NPR,http://www.npr.org/2015/11/01/450889721/the-ma...,The Madonna Of 115th Street Gets A Long-Awaite...,SOCIAL_LIFE/DAILY
1,3732492,258,2015-11-01,US,Huffington Post,http://www.huffingtonpost.com/entry/university...,University Of Louisville Sorry A Bunch Of Its ...,SOCIAL_LIFE/DAILY
2,3732496,489,2015-11-01,US,Bleacher Report,http://bleacherreport.com/articles/2584708-kar...,'Kareem: Minority of One' HBO Documentary Prev...,POLITICS
3,3732501,840,2015-11-01,US,VentureBeat,http://venturebeat.com/2015/11/01/what-big-ind...,What big industry will do to the Internet of T...,HEALTH/MEDICAL
4,3732502,470,2015-11-01,US,Tech Insider,http://www.techinsider.io/daenerys-game-of-thr...,Daenerys has been traveling in the wrong direc...,SPORTS


In [6]:
# Final FactBook Data structure
all_facts_org.head(2)

Unnamed: 0,Country,Population,Age_structure0-14,Age_structure15-24,Age_structure25-54,Age_structure55-64,Age_structureover65,Median_age,Population_growth_rate,Birth_rate,...,Sex_ratio,Life_expectancy_at_birth,GDP_per_capita,Unemployment_rate,Inflation_rate,Electricity_renewable_sources,Carbon_dioxide_emissions,Internet_users,Religions,Ethnic_groups
0,United States,323995528,18.84,13.46,39.6,12.85,15.25,37.9,0.81,12.5,...,0.97,79.8,57300,4.7,1.3,7.4,5402.0,74.6,"Protestant 46.5%, Roman Catholic 20.8%, Mormon...","white 79.96%, black 12.85%, Asian 4.43%, Ameri..."
1,Ireland,4952473,21.51,11.8,43.52,10.33,12.84,36.4,1.2,14.5,...,1.0,80.8,69400,8.0,0.2,25.0,34.0,80.1,"Roman Catholic 84.7%, Church of Ireland 2.7%, ...","Irish 84.5%, other white 9.8%, Asian 1.9%, bla..."


## Now Corpus Additional Analysis and Plotting

In this part we did some further analysis on Now Corpus data based on the new topics assigned to articles in order to answer the following research questions:

- What are the main topics of the published news? (tech, politics, sports, etc.)
- What are the distributions of articles over country and time?
- What are the distributions of these topics over country and time?
- What are some mostly used words in the countries topics?


In [7]:
sources = full_data_sources
# change date to date time type
sources.date =  pd.to_datetime(sources.date, format='%Y-%m-%d')

### What are the main topics of the published news? 

We have found 12 different topics overall for the 20 English speaking country news. We keep the name conventions between the countries same in order to ease the comparison between countries.

In [8]:
# main news topics we found from the now corpus data with LDA
display(sources.Topics.unique())
display(len(sources.Topics.unique()))

array(['SOCIAL_LIFE/DAILY', 'POLITICS', 'HEALTH/MEDICAL', 'SPORTS',
       'TECHNOLOGY/SCIENCE/SOCIAL MEDIA', 'POLICE/ACCIDENT/VIOLENCE',
       'LEGAL/LAW', 'ENVIRONMENT/ENERGY', 'ENTERTAINMENT/ART/MAGAZINE',
       'ECONOMY', 'COMPANY/BUSINESS', 'INTERNATIONAL'], dtype=object)

12

### What are the distributions of articles over country and time?

Since we decided to focus on the last year news, we explored the article distributions of the countries over the months. Since we already have all the articles, we used group by per country and month and count the number of article ids (they are unique) to see the distribution of articles for each country over time.

In [9]:
# Time Article Count per Country per month
articles_per_country_month = sources.groupby(by=[sources.date.dt.month, 'country'])['textID'].count()
articles_per_country_month = pd.DataFrame(articles_per_country_month)
articles_per_country_month = articles_per_country_month.reset_index()
articles_per_country_month.rename(columns={'textID':'count'},inplace=True)
display(articles_per_country_month.T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,229,230,231,232,233,234,235,236,237,238
date,1,1,1,1,1,1,1,1,1,1,...,12,12,12,12,12,12,12,12,12,12
country,AU,BD,CA,GB,GH,HK,IE,IN,JM,KE,...,LK,MY,NG,NZ,PH,PK,SG,TZ,US,ZA
count,2268,767,3121,4108,1431,300,3065,4335,500,816,...,327,265,238,1378,1597,354,2282,170,4220,2364


In [10]:
articles_per_country_month.to_csv('../Data/articles_timeline.csv',index=False)
articles_per_country_month.T.to_csv('../Data/articles_timeline_transpose.csv')

In [11]:
country_list = articles_per_country_month.country.unique()

In [12]:
# ISO Codes Dictionary in order to be able to plot on the plotly world map
iso_codes_map = {'US':'USA', 'CA':'CAN', 'GB':'GBR', 'IE':'IRL',
                 'AU':'AUS', 'NZ':'NZL', 'IN':'IND', 'LK':'LKA',
                 'PK':'PAK', 'BD':'BGD', 'MY':'MYS', 'SG':'SGP',
                 'PH':'PHL', 'HK':'HKG', 'ZA':'ZAF', 'NG':'NGA',
                 'GH':'GHA', 'KE':'KEN', 'TZ':'TZA', 'JM':'JAM'}

We created a slider map per month to better visualize the total article number changes for countries over time.

In [13]:
title = 'Article Count per Country for each Month'
traces=[]

# Create the 12 month data (Article Count per Country for each month)
for i in range(1,13):
    one_month = articles_per_country_month[articles_per_country_month.date ==i]
    countries_month = []
    for c in one_month.country:
        countries_month.append(iso_codes_map[c])
    trace = {
        "name": "Article",
        "z": one_month['count'],    
        "colorbar": {
        "x": -0.1, 
        "y": 0.5,  
        "ticks": "inside"
      }, 
      "colorscale": [
        [0, "rgb(220, 220, 220)"], [0.2, "rgb(245, 195, 157)"], [0.4, "rgb(245, 160, 105)"], [1, "rgb(178, 10, 28)"]], 
      "locations":countries_month,
      "locationssrc": "gccg:56:ef6258", 
      "showscale": True, 
      "type": "choropleth", 
      "uid": "0f0f64", 
      "zauto": False, 
      "zmax": articles_per_country_month['count'].max(), 
      "zmin": articles_per_country_month['count'].min(), 
      "zsrc": "gccg:56:49c591"
    }
    traces.append(trace)
data = traces

# Creating steps for slider
steps = []
for i in range(len(data)):
    step = dict(
        method = 'restyle',
        label = i+1, 
        args = ['visible', [False] * len(data)],
    )
    step['args'][1][i] = True # Toggle i'th trace to "visible"
    steps.append(step)

# Specifying the slider info
sliders = [dict(
    active = 1,
    currentvalue = {"prefix": "Month: "},
    pad = {"t": 12},
    steps = steps
)]

# Specfying the layout data of the plot
layout = {
  "sliders":sliders,
  "autosize": False, 
  "dragmode": "pan", 
  "geo": {
    "center": {
      "lat": 14.6663948865, 
      "lon": 108.63338266
    }, 
    "projection": {
      "rotation": {
        "lat": 15.1157137652, 
        "lon": 108.63338266
      }, 
      "scale": 0.972654947412, 
      "type": "equirectangular"
    }
  }, 
  "height": 500, 
  "showlegend": False, 
  "title": title, 
  "titlefont": {"size": 24}, 
  "width": 800, 
  "paper_bgcolor" : 'rgba(0,0,0,0)',
  "plot_bgcolor" :'rgba(0,0,0,0)',
   "margin" : {"r":10, "t":35},  
}

# Create the figure and plot
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='maps_article_nums')

From the above plot we can observe several things:
- Tanzania, Kenya, Ghana  has very few articles all the time compared to other countries such as Canada, USA, Great Britain and India. We can see from here as well that our data is not equally distributed neither within months nor overall. It is important to keep in mind that for some countries, since the data is limited the interpretations made here could be misleading compared to the actual news media.
- According to our data the article counts are fewer (paler colors) in the winter season in the overall most countries.

### What are the distributions of these topics over country and time?

The topics distribution for each country can be seen from below results.

In [14]:
# Topics article count for each country for each month
topics_per_country_month = sources.groupby(by=[sources.date.dt.month, 'country','Topics'])['textID'].count()
topics_per_country_month =  pd.DataFrame(topics_per_country_month)
topics_per_country_month = topics_per_country_month.reset_index()
topics_per_country_month.rename(columns={'textID':'topics_count'},inplace=True)
display(topics_per_country_month.T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968
date,1,1,1,1,1,1,1,1,1,1,...,12,12,12,12,12,12,12,12,12,12
country,AU,AU,AU,AU,AU,AU,AU,AU,BD,BD,...,US,ZA,ZA,ZA,ZA,ZA,ZA,ZA,ZA,ZA
Topics,ECONOMY,ENTERTAINMENT/ART/MAGAZINE,HEALTH/MEDICAL,INTERNATIONAL,LEGAL/LAW,SOCIAL_LIFE/DAILY,SPORTS,TECHNOLOGY/SCIENCE/SOCIAL MEDIA,ENVIRONMENT/ENERGY,INTERNATIONAL,...,TECHNOLOGY/SCIENCE/SOCIAL MEDIA,ECONOMY,ENVIRONMENT/ENERGY,HEALTH/MEDICAL,INTERNATIONAL,LEGAL/LAW,POLITICS,SOCIAL_LIFE/DAILY,SPORTS,TECHNOLOGY/SCIENCE/SOCIAL MEDIA
topics_count,435,337,303,71,268,289,209,356,160,136,...,443,287,38,368,288,279,162,362,425,155


In [15]:
# Maximum topic per country per month
max_topics_country_month = topics_per_country_month.loc[topics_per_country_month.groupby(["date", "country"])["topics_count"].idxmax()]

In [16]:
title = 'Most Popular Topic per Country for each Month'
traces=[]

# Creating the data traces per month
for i in range(1,13):
    one_month = max_topics_country_month[max_topics_country_month.date ==i]
    countries_month = []
    for c in one_month.country:
        countries_month.append(iso_codes_map[c])
    trace = {
        "name":"Topics",
        "z": one_month['topics_count'],    
        "text" :one_month.Topics.astype(str),
        "colorbar": {
        "x": -0.1, 
        "y": 0.5,  
        "ticks": "inside"
      }, 
      "colorscale": [
        [0, "rgb(220, 220, 220)"], [0.2, "rgb(245, 195, 157)"], [0.4, "rgb(245, 160, 105)"], [1, "rgb(178, 10, 28)"]], 
      "locations":countries_month,
      "locationssrc": "gccg:56:ef6258", 
      "showscale": True, 
      "type": "choropleth", 
      "uid": "0f0f64", 
      "zauto": False, 
      "zmax": max_topics_country_month['topics_count'].max(), 
      "zmin": max_topics_country_month['topics_count'].min(), 
      "zsrc": "gccg:56:49c591"
    }
    traces.append(trace)
data = traces

# creating the steps for slider
steps = []
for i in range(len(data)):
    step = dict(
        method = 'restyle',
        label = i+1, 
        args = ['visible', [False] * len(data)],
    )
    step['args'][1][i] = True # Toggle i'th trace to "visible"
    steps.append(step)

# Specifying cluster
sliders = [dict(
    active = 1,
    currentvalue = {"prefix": "Month: "},
    pad = {"t": 12},
    steps = steps
)]

# Specifying the layout
layout = {
  "sliders":sliders,
  "autosize": False, 
  "dragmode": "pan", 
  "geo": {
    "center": {
      "lat": 14.6663948865, 
      "lon": 108.63338266
    }, 
    "projection": {
      "rotation": {
        "lat": 15.1157137652, 
        "lon": 108.63338266
      }, 
      "scale": 0.972654947412, 
      "type": "equirectangular"
    }
  }, 
  "height": 500, 
  "showlegend": False, 
  "title": title, 
  "titlefont": {"size": 24}, 
  "width": 800, 
  "paper_bgcolor" : 'rgba(0,0,0,0)',
  "plot_bgcolor" :'rgba(0,0,0,0)',
   "margin" : {"r":10, "t":35},  
}

# Plot the map
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='maps_topics_nums')

In th topic distrbution map we showed the most published topic in each country for each month.

From the Topic Distribution Map for each Country over time we can see that:
- Overall, the most frequently published articles for each country doesn't change very often. This can show in which topics are countries are mostly talking based on the data we have. Some examples, could be USA and India talking Politics most whereas for Australia the topics diverge between Economy, Tech and Entertainment/Art/Magazine and they mostly talk Economy in the beginning of the year whereas they tend to publish more on Tech/Science articles in the second term of the year. 

- For some countries the most published topics in every month doesn't change over time. For example, in South Africa we see that Sports is always the most published topic for 2016.
Note: One thing to consider is that if the number of unique web resources that the data collected for a country is limited and it is a web resource on a specific topic, the most frequent topics may seem not to be changing. This might be the case for South Africa since it has lower number of unique resources and if these resources also publishes only on specific topics. 


### What are some mostly used words in the countries topics?

We also check the mostly used words for each topics per country in the overall 1 year data.

In [17]:
most_freq_words = pd.read_csv('../Data/most_freq_words.csv')
most_freq_words 

Unnamed: 0,us,ie,ca,gb,gh,bd,hk,ng,pk,sg,tz,za,in,jm,ke,lk,my,nz,ph,au
0,electoral,klopp,fossil,wimbledon,entertainment,lawyer,insurance,stock,congress,merger,operator,teammate,puja,video,innovation,disaster,auto,copyright,firearm,nasa
1,romney,sunderland,tracking,milan,wear,sentence,marketing,trillion,erdogan,deutsche,online,fitness,shaka,perform,consumption,submit,circuit,prohibit,supt,integrate
2,objectionable,spur,hectare,diego,actress,supreme,partneship,regulatory,coup,malay,user,warrior,azad,dance,planning,flood,researcher,prosecutor,raid,wireless
3,pentagon,coleman,brook,clark,gospel,petition,secure,manufacturing,china-pakistan,civillian,stream,celtic,anti-nation,popular,profit,coal,tech,supreme,calamity,hybrid
4,libertarian,horgan,panther,spearhead,george,allegation,monetary,trader,census,religion,provider,winning,liberal,singer,machinery,allege,computer,suspend,isolated,audio
5,POLITICS,SPORTS,ENVIRONMENT/ENERGY,SPORTS,ENTERTAINMENT/ART/MAGAZINE,LEGAL/LAW,COMPANY/BUSINESS,ECONOMY,POLITICS,SOCIAL_LIFE/DAILY,TECHNOLOGY/SCIENCE/SOCIAL MEDIA,SPORTS,POLITICS,ENTERTAINMENT/ART/MAGAZINE,COMPANY/BUSINESS,ENVIRONMENT/ENERGY,TECHNOLOGY/SCIENCE/SOCIAL MEDIA,LEGAL/LAW,POLICE/ACCIDENT/VIOLENCE,TECHNOLOGY/SCIENCE/SOCIAL MEDIA


In [18]:
# Plot the Map for Most Frequent 5 Words
title = 'Most Frequent Words in Most Trending Topic per Country'

one_month = max_topics_country_month[max_topics_country_month.date ==i]
countries_month = []
texts=[]
for c in one_month.country:
    countries_month.append(iso_codes_map[c])
    texts.append('Topic: '+ most_freq_words[c.lower()][5]+'\n5 Frequent Words: \n'+'\n '.join(list(most_freq_words[c.lower()][:5])))


trace = {
    "name": 'Words',
    "z": np.ones(20),    
    "text" :texts,
    "colorbar": {
    "x": -0.1, 
    "y": 0.5,  
    "ticks": "inside"
  }, 
  "colorscale": 'Blues',
  "locations":countries_month,
  "locationssrc": "gccg:56:ef6258", 
  "showscale": False, 
  "type": "choropleth", 
  "uid": "0f0f64", 
  "zauto": False, 
  "zmax": 0, 
  "zmin": 20000, 
  "zsrc": "gccg:56:49c591"
}

data = [trace]

layout = {
  "autosize": False, 
  "dragmode": "pan", 
  "geo": {
    "center": {
      "lat": 14.6663948865, 
      "lon": 108.63338266
    }, 
    "projection": {
      "rotation": {
        "lat": 15.1157137652, 
        "lon": 108.63338266
      }, 
      "scale": 0.972654947412, 
      "type": "equirectangular"
    }
  }, 
  "height": 500, 
  "showlegend": False, 
  "title": title, 
  "titlefont": {"size": 24}, 
  "width": 800, 
  "paper_bgcolor" : 'rgba(0,0,0,0)',
  "plot_bgcolor" :'rgba(0,0,0,0)',
   "margin" : {"r":10, "t":35},  
}

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='maps_freq_words')

We can see the most published news topic for each country in the overall 1 year data we used. For each year, we shared the most frequent and meaningful 5 words LDA find. These 5 words are after excluding the person names and less unreasonable words but each these 5 words are within 10-15 most frequent word range for that topic.

## Unique Source Distribution  ---  Data Collection Bias

We also checked the data bias that we talked before with the limited number of articles and limited number of resources. Since we also have each Country profile from Factbook data, we checked the web source limitations of the countries if they are related with the Internet usage rate of the countries.

In [19]:
# How many website per country?   (count distinct websites)
websites_per_country = sources.groupby(by=['country'])['website'].nunique()
websites_per_country = pd.DataFrame(websites_per_country)
websites_per_country = websites_per_country.reset_index()

In [20]:
# Correlation 
merged_data = topic_distribution.merge(all_facts_org, left_on='country_name', right_on='Country')
merged_data.head(2)
data = merged_data.copy()
internet_users = data[['country','Internet_users']]
#internet_users

In [21]:
websites_userPercentage = pd.merge(websites_per_country, internet_users, left_on='country',right_on='country')

In [22]:
websites_userPercentage.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
country,AU,BD,CA,GB,GH,HK,IE,IN,JM,KE,LK,MY,NG,NZ,PH,PK,SG,TZ,US,ZA
website,762,21,988,2040,49,68,218,579,13,60,54,89,117,193,209,157,167,14,4155,279
Internet_users,84.6,14.4,88.5,92,23.5,85,80.1,26,43.2,45.6,30,71.1,47.4,88.2,40.7,18,82.1,5.4,74.6,51.9


In [23]:
# Internet Usage vs Number of Unique Websites Sources used per Country Plot
trace1 = go.Scatter(
    x=websites_userPercentage.country,
    y=websites_userPercentage.website,
    name='number of unique website',
     mode = 'markers'
)
trace2 = go.Scatter(
    x=websites_userPercentage.country,
    y=websites_userPercentage.Internet_users,
    name='internet usage',
     mode = 'markers',
    yaxis='y2'
)
data_chart = [trace1, trace2]
layout = go.Layout(
    title='Internet Usage vs Number of Unique Websites Sources Used per Country',
    xaxis=dict(
        title='Countries',
    ),
    yaxis=dict(
        title='Number of Unique Websites',
    ),
    yaxis2=dict(
        title='Internet Usage Percentage',
        titlefont=dict(
            color='rgb(148, 103, 189)'
        ),
        overlaying='y',
        side='right'
    ),
    paper_bgcolor = 'rgba(0,0,0,0)',
    plot_bgcolor = 'rgba(0,0,0,0)'
)

fig = go.Figure(data=data_chart, layout=layout)
py.iplot(fig, filename='multiple-internet-usage')


As can be seen, for some countries like Jamaica, Bangladesh and Tanzania, number of unique websites are quite low as well as the internet usage. Especially the former, introduces a huge bias to topic distribution of that country. For example, dominant topic on the news appeared as TECHNOLOGY/SCIENCE/SOCIAL MEDIA in Tanzania. However this is because of those websites where news are collected are particulary tech forms or blogs. Those kind of biases exist for each country but of course their effects are not as severe as of Tanzania. When the number of unique sources increases, bias of those resourses decrease since they likely to cover various aspects of news better.

## FactBook Additional Analysis and Plotting 

The detailed additional analysis made of this part after the second milestone is added to the FactBook Notebook. In here we only read the cleaned data and plot again the visuals for the website.

In [24]:
factBook = pd.read_csv('../Data/AllFacts.csv')

In [25]:
ys=['Population','Age_structure0-14','Age_structure15-24','Median_age',
    'Population_growth_rate','Birth_rate','Death_rate','Net_migration_rate',
   'Sex_ratio', 'Life_expectancy_at_birth','Unemployment_rate',
   'Inflation_rate','GDP_per_capita','Electricity_renewable_sources','Carbon_dioxide_emissions',
   'Internet_users']

In [55]:
from plotly import tools
traces = []
for i in range(16):
    trace1 = go.Bar(
        x=factBook['Country'],
        y=factBook[ys[i]],
        name=ys[i]
    )
    traces.append(trace1)
data = traces
fig = tools.make_subplots(rows=4, cols=4, shared_yaxes=True)

for r in range(0,4):
    for c in range(0,4):
        fig.append_trace(traces[r*4+c], r+1, c+1)
        
dics= dict( ticks='', showticklabels=False  )

fig['layout'].update(height=600, width=600, 
                     title='FactBook Plots', xaxis = dics, xaxis5 = dics,  xaxis9 = dics,  xaxis13 = dics, 
                                            xaxis2 = dics,  xaxis6 = dics,  xaxis10 = dics,  xaxis14 = dics,  
                                            xaxis3 = dics,  xaxis7 = dics,  xaxis11 = dics,  xaxis15 = dics,  
                                            xaxis4 = dics,  xaxis8 = dics,  xaxis12 = dics,  xaxis16 = dics,
                    paper_bgcolor = 'rgba(0,0,0,0)',
                    plot_bgcolor = 'rgba(0,0,0,0)',
                    margin = go.layout.Margin(
                        t=30,
                        l=30))

py.iplot(fig, filename='fact_plots')

This is the format of your plot grid:
[ (1,1) x1,y1 ]   [ (1,2) x2,y1 ]   [ (1,3) x3,y1 ]   [ (1,4) x4,y1 ] 
[ (2,1) x5,y2 ]   [ (2,2) x6,y2 ]   [ (2,3) x7,y2 ]   [ (2,4) x8,y2 ] 
[ (3,1) x9,y3 ]   [ (3,2) x10,y3 ]  [ (3,3) x11,y3 ]  [ (3,4) x12,y3 ]
[ (4,1) x13,y4 ]  [ (4,2) x14,y4 ]  [ (4,3) x15,y4 ]  [ (4,4) x16,y4 ]



In addition to the previous analysis, we added the joint bar plots of the Factbook data to discover better the countries profile compared to the facts.

## Correlation Analysis

In the correlation analysis we tried different approaches to find correlation between 2 data.
One of the approaches is to check if we can improve the results by assigning categorical values to some of the factbook features, we tried to standardize the data also before the correlation analysis and tried 2 different correlation methods spearman and pearson.

According to our results, we ended up not to use categorical data since it doesn't change our correlation results.
Also, we decided to use Spearman correlation since it covers the monotic increase and decrease and catches the trends better compared to the pearson correlation in our data.

In [27]:
cols = all_facts_org.columns
all_facts_std = all_facts_org.copy()
all_facts_std[cols[1:-2]] = all_facts_std[cols[1:-2]].apply(lambda x: (x - np.mean(x)) / (np.std(x)))

In [28]:
all_facts_std.head(3)

Unnamed: 0,Country,Population,Age_structure0-14,Age_structure15-24,Age_structure25-54,Age_structure55-64,Age_structureover65,Median_age,Population_growth_rate,Birth_rate,...,Sex_ratio,Life_expectancy_at_birth,GDP_per_capita,Unemployment_rate,Inflation_rate,Electricity_renewable_sources,Carbon_dioxide_emissions,Internet_users,Religions,Ethnic_groups
0,United States,0.706372,-0.814154,-0.904887,0.070152,1.214698,1.153125,0.956782,-0.745739,-0.794542,...,-0.520573,0.755736,1.16856,-0.470022,-0.622597,0.143751,4.087505,0.710816,"Protestant 46.5%, Roman Catholic 20.8%, Mormon...","white 79.96%, black 12.85%, Asian 4.43%, Ameri..."
1,Ireland,-0.4612,-0.533846,-1.404259,0.900961,0.507516,0.697732,0.773139,-0.132803,-0.549124,...,0.189299,0.873636,1.640903,-0.127622,-0.859982,2.765532,-0.403794,0.906437,"Roman Catholic 84.7%, Church of Ireland 2.7%, ...","Irish 84.5%, other white 9.8%, Asian 1.9%, bla..."
2,Australia,-0.39518,-0.919139,-1.0553,0.483437,0.925651,1.260831,1.042482,-0.368547,-0.843626,...,0.425924,1.038695,0.836749,-0.355889,-0.601016,0.173544,-0.11012,1.066491,"Protestant 30.1% (Anglican 17.1%, Uniting Chur...","English 25.9%, Australian 25.4%, Irish 7.5%, S..."


In [29]:
# Describe facts data
all_facts_std.Inflation_rate.describe()

count    2.000000e+01
mean     1.401657e-16
std      1.025978e+00
min     -1.075787e+00
25%     -6.064113e-01
50%     -3.528407e-01
75%      3.053637e-01
max      2.938181e+00
Name: Inflation_rate, dtype: float64

In [30]:
all_facts_cat= all_facts_std.copy()

In [31]:
# Converting the data into categorical for facts as 3 levels low medium and high
# Since the results havent changed we skipped this part.

labels = [1,2,3]
cols_to_cat=['Carbon_dioxide_emissions','Electricity_renewable_sources','GDP_per_capita',
             'Unemployment_rate','Inflation_rate','Life_expectancy_at_birth','Internet_users']
cols_to_cat = all_facts_org.columns[1:-2]
for col in cols_to_cat:
    all_facts_cat[col] = pd.cut(all_facts_org[col],len(labels), labels=labels)
all_facts_cat

Unnamed: 0,Country,Population,Age_structure0-14,Age_structure15-24,Age_structure25-54,Age_structure55-64,Age_structureover65,Median_age,Population_growth_rate,Birth_rate,...,Sex_ratio,Life_expectancy_at_birth,GDP_per_capita,Unemployment_rate,Inflation_rate,Electricity_renewable_sources,Carbon_dioxide_emissions,Internet_users,Religions,Ethnic_groups
0,United States,1,1,1,2,3,3,3,1,1,...,2,3,2,1,1,1,3,3,"Protestant 46.5%, Roman Catholic 20.8%, Mormon...","white 79.96%, black 12.85%, Asian 4.43%, Ameri..."
1,Ireland,1,1,1,2,2,2,3,2,1,...,2,3,3,1,1,3,1,3,"Roman Catholic 84.7%, Church of Ireland 2.7%, ...","Irish 84.5%, other white 9.8%, Asian 1.9%, bla..."
2,Australia,1,1,1,2,3,3,3,1,1,...,2,3,2,1,1,1,1,3,"Protestant 30.1% (Anglican 17.1%, Uniting Chur...","English 25.9%, Australian 25.4%, Irish 7.5%, S..."
3,United Kingdom,1,1,1,2,3,3,3,1,1,...,2,3,2,1,1,2,1,3,"Christian (includes Anglican, Roman Catholic, ...","white 87.2%, black/African/Caribbean/black Bri..."
4,Canada,1,1,1,2,3,3,3,1,1,...,2,3,2,1,1,1,1,3,"Catholic 39% (includes Roman Catholic 38.8%, o...","Canadian 32.2%, English 19.8%, French 15.5%, S..."
5,India,3,2,3,2,1,1,2,2,2,...,3,2,1,1,2,2,2,1,"Hindu 79.8%, Muslim 14.2%, Christian 2.3%, Sik...","Indo-Aryan 72%, Dravidian 25%, Mongoloid and o..."
6,New Zealand,1,1,1,2,3,3,3,1,1,...,2,3,2,1,1,2,1,3,"Christian 44.3% (Catholic 11.6%, Anglican 10.8...","European 71.2%, Maori 14.1%, Asian 11.3%, Paci..."
7,South Africa,1,2,3,2,1,1,2,1,2,...,2,1,1,3,2,1,1,2,"Protestant 36.6% (Zionist Christian 11.1%, Pen...","black African 80.2%, white 8.4%, colored 8.8%,..."
8,Sri Lanka,1,2,2,2,2,2,2,1,1,...,2,3,1,1,1,1,1,1,"Buddhist (official) 70.2%, Hindu 12.6%, Muslim...","Sinhalese 74.9%, Sri Lankan Tamil 11.2%, Sri L..."
9,Singapore,1,1,2,3,2,2,2,2,1,...,2,3,3,1,1,1,1,3,"Buddhist 33.9%, Muslim 14.3%, Taoist 11.3%, Ca...","Chinese 74.2%, Malay 13.3%, Indian 9.2%, other..."


In [32]:
#all_facts = all_facts_cat
all_facts = all_facts_org

In [33]:
all_facts.loc[all_facts['Country']=='Tanzania','Unemployment_rate']=None

In [34]:
TOPIC_LIST = ['ENVIRONMENT/ENERGY','INTERNATIONAL','POLITICS', \
              'SPORTS', 'TECHNOLOGY/SCIENCE/SOCIAL MEDIA', 'SOCIAL_LIFE/DAILY', \
              'ENTERTAINMENT/ART/MAGAZINE', 'COMPANY/BUSINESS', 'ECONOMY', \
              'POLICE/ACCIDENT/VIOLENCE', 'LEGAL/LAW','HEALTH/MEDICAL']

In [35]:
# Correlation 
merged_data = topic_distribution.merge(all_facts, left_on='country_name', right_on='Country')
merged_data.head(2)
data = merged_data.copy()

In [36]:
fact_columns = []
for i in data.columns:
    if(i not in TOPIC_LIST and i not in ['country','country_name','Country']):
        fact_columns.append(i)  

We created pie charts for each topic distrbution for countries to see topics distrbutions for countries visually.

In [37]:
# Pie Charts for Each Country based on topic distributions
x1s= [0.01, 0.26, 0.51, 0.76]
x2s= [0.25, 0.5, 0.75, 1]
y1s= [0.81, 0.61, 0.41, 0.21, 0.01]
y2s= [1, 0.8, 0.6, 0.4, 0.2]
figs=[]
annots= []
for i in range(5):

    y1= y1s[i]
    y2= y2s[i]
    for j in range(4):
        country = topic_distribution.country.values[i*4+j]
        val =topic_distribution[topic_distribution['country']==country].values
        x1= x1s[j]
        x2= x2s[j]
        figs_data= {
              "values": val[0][2:],
              "labels": TOPIC_LIST,
              "domain": {"x": [x1, x2], "y":[y1,y2]},
              "name": country,
              "hoverinfo":"label+percent+name",
              "hole": .4,
              "type": "pie",
              "textposition": "inside"
            }
        figs.append(figs_data)
        an={    
            "font": {
                "size": 20
            },
            "showarrow": False,
            "text":'',# country,
            "x": x1+0.095,
            "y": y1+0.13    
        }
        annots.append(an)

fig = {
  "data": figs,
  "layout": {
        #"title":"Global Emissions 1990-2011",
        "annotations": annots,
            "paper_bgcolor" : 'rgba(0,0,0,0)',
            "plot_bgcolor" :'rgba(0,0,0,0)',
            "margin" : {"r":10, "t":30},         
    }
}
py.iplot(fig, filename='donut_all_code')

From the above pie charts, we can see some spot on results:
- While some countries topics are more equally distrbuted some countries seem to be focused more on some specific results. For example, Phillipines, Great Britain, Canada, US news seems to be distrbuted slightly more equally compared to countries like Pakistan, Sri Lanka and India.
- Health/Medical seems to have a lower percentage in the overall countries compared to other topics. 

In [38]:
# Topic Distribution per country
traces=[]
for topic in TOPIC_LIST:
    trace1 = go.Bar(
        x=topic_distribution.country_name,
        y=topic_distribution[topic],
        name=topic
    )
    traces.append(trace1)
layout = go.Layout(
    barmode='stack',
    paper_bgcolor = 'rgba(0,0,0,0)',
    plot_bgcolor = 'rgba(0,0,0,0)',
    title = 'Topic Distributions',
    margin = go.layout.Margin(
        t=30,
        l=30
    )   
)
fig = go.Figure(data=traces, layout=layout)
py.iplot(fig, filename='topic_distribution.html',colorscale='Light24')

In [39]:
data.describe()

Unnamed: 0,ENVIRONMENT/ENERGY,INTERNATIONAL,POLITICS,SPORTS,TECHNOLOGY/SCIENCE/SOCIAL MEDIA,SOCIAL_LIFE/DAILY,ENTERTAINMENT/ART/MAGAZINE,COMPANY/BUSINESS,ECONOMY,POLICE/ACCIDENT/VIOLENCE,...,Death_rate,Net_migration_rate,Sex_ratio,Life_expectancy_at_birth,GDP_per_capita,Unemployment_rate,Inflation_rate,Electricity_renewable_sources,Carbon_dioxide_emissions,Internet_users
count,13.0,12.0,14.0,16.0,12.0,18.0,14.0,13.0,15.0,12.0,...,20.0,20.0,20.0,20.0,20.0,19.0,20.0,20.0,20.0,20.0
mean,0.103932,0.115086,0.140971,0.141445,0.125966,0.121846,0.132206,0.103798,0.115236,0.12929,...,7.25,1.145,0.992,73.39,27365.0,9.768421,4.185,6.435,516.615,54.615
std,0.07076,0.069292,0.096827,0.053675,0.075115,0.052363,0.054853,0.063879,0.051646,0.063986,...,1.910773,4.040124,0.043359,8.702141,26282.479155,9.853372,4.754198,6.887385,1226.249218,28.845966
min,0.014033,0.017842,0.033522,0.070796,0.028847,0.054292,0.042077,0.017421,0.020969,0.011228,...,3.5,-4.5,0.87,53.4,3100.0,2.1,-0.8,0.0,10.0,5.4
25%,0.027872,0.064097,0.073536,0.106116,0.076383,0.10482,0.100629,0.055618,0.07549,0.09066,...,6.35,-1.325,0.97,67.425,5700.0,4.8,1.375,0.625,29.5,29.0
50%,0.111274,0.114375,0.11434,0.123362,0.112088,0.117008,0.134671,0.084825,0.138273,0.15344,...,7.15,-0.2,0.99,74.3,12200.0,5.8,2.55,4.9,93.5,49.65
75%,0.141424,0.155383,0.175151,0.172863,0.167227,0.133094,0.151726,0.125539,0.15639,0.177225,...,7.9,2.85,1.01,80.9,46850.0,8.2,5.6,11.375,409.25,82.725
max,0.234718,0.256281,0.347152,0.252041,0.291896,0.298739,0.243091,0.224668,0.176033,0.211135,...,12.7,13.6,1.08,85.0,87100.0,40.0,17.8,25.0,5402.0,92.0


### FactBook data and Now Corpus New data Correlation Results

In [40]:
# create base data frame for correlation between FactsBook data and News Data
correlations = pd.DataFrame(fact_columns[:-2]) 
for col in TOPIC_LIST:
    correlations[col] = None
correlations.rename(columns={0:'facts'},inplace=True)
correlations.head()

Unnamed: 0,facts,ENVIRONMENT/ENERGY,INTERNATIONAL,POLITICS,SPORTS,TECHNOLOGY/SCIENCE/SOCIAL MEDIA,SOCIAL_LIFE/DAILY,ENTERTAINMENT/ART/MAGAZINE,COMPANY/BUSINESS,ECONOMY,POLICE/ACCIDENT/VIOLENCE,LEGAL/LAW,HEALTH/MEDICAL
0,Population,,,,,,,,,,,,
1,Age_structure0-14,,,,,,,,,,,,
2,Age_structure15-24,,,,,,,,,,,,
3,Age_structure25-54,,,,,,,,,,,,
4,Age_structure55-64,,,,,,,,,,,,


When we do the correlation analysis between the datasets, we did several iterations by checking sum factors as mentioned above such as making some data on the factbook categorical, standardize and non-standardized version of the data (too see if extreme values effect too much or not) and 2 different correlation methods Spearman and Pearson. Our results below is from our final decision of using Spearman correlation.

In [41]:
correlations_res = []
for topics in TOPIC_LIST:
    for fact in fact_columns:
        if(fact not in ['Religions','Ethnic_groups']):
            dfx = data[[topics,fact]].copy()
            dfx.dropna(inplace=True)
            corr,pvalue = spearmanr(dfx[topics],dfx[fact])
            #if(topics== 'EDUCATION/FAMILY'):
            #    print(fact,topics,corr)
            correlations.loc[correlations['facts']==fact,[topics]] = corr
            #topics_distribution.loc[topics_distribution['country'] == country, topic] = freq
            if(abs(corr) > 0.25):
                if(pvalue < 0.05):
                    correlations_res.append('***'+fact +' - '+topics+' correlation: '+ str(corr)+' pvalue: '+str(pvalue))
                elif(pvalue < 0.08):
                    correlations_res.append('*'+fact +' - '+topics+' correlation: '+ str(corr)+' pvalue: '+str(pvalue))
                else:
                    correlations_res.append(fact +' - '+topics+' correlation: '+ str(corr)+' pvalue: '+str(pvalue))
                
                #print ('Pearson Correlation: corr_val:' ,corr,' p_value: ',pvalue)

For each 12 Topics and 21 Facts we created the pairwise correlation heat map. Before, we calculated the spreaman correlations and also saved the resulted which has correlation higher than 0.25 in its absolute value. From these selected correlation values we also checked the pvalue results of them. 
- '***' means Significant result according to p-value    
- '*' means marginally Signifant result according to p-value
- ' ' means the correlation value is high but the pvalue is not significant.

As seen from below we have a total of 8 correlations (counting significant and marginally significant results).

In [42]:
correlations_res

['Age_structureover65 - ENVIRONMENT/ENERGY correlation: 0.25274725274725274 pvalue: 0.40477407127094656',
 'Population_growth_rate - ENVIRONMENT/ENERGY correlation: -0.2558461842645392 pvalue: 0.3988401854392244',
 'Population_growth_rate - INTERNATIONAL correlation: -0.4098079839779326 pvalue: 0.18582449084404484',
 'Death_rate - INTERNATIONAL correlation: -0.3047290137271806 pvalue: 0.3355078216764543',
 'Net_migration_rate - INTERNATIONAL correlation: -0.40280271929454914 pvalue: 0.19420166941805767',
 'Sex_ratio - INTERNATIONAL correlation: -0.2736858952707765 pvalue: 0.3893670040175019',
 'Population - POLITICS correlation: 0.45494505494505494 pvalue: 0.10215352669182914',
 'Age_structure15-24 - POLITICS correlation: -0.3186813186813187 pvalue: 0.2667849608113615',
 '*Net_migration_rate - POLITICS correlation: 0.5032967032967033 pvalue: 0.06656042759911303',
 'Sex_ratio - POLITICS correlation: 0.4022160834399561 pvalue: 0.15395395029976613',
 'Electricity_renewable_sources - POLIT

In [43]:
correlations_res =  pd.DataFrame(correlations_res)
correlations_res.to_csv('Correlation_Results.txt',index=False)

In [44]:
display(correlations)

Unnamed: 0,facts,ENVIRONMENT/ENERGY,INTERNATIONAL,POLITICS,SPORTS,TECHNOLOGY/SCIENCE/SOCIAL MEDIA,SOCIAL_LIFE/DAILY,ENTERTAINMENT/ART/MAGAZINE,COMPANY/BUSINESS,ECONOMY,POLICE/ACCIDENT/VIOLENCE,LEGAL/LAW,HEALTH/MEDICAL
0,Population,-0.0494505,0.167832,0.454945,-0.388235,-0.272727,0.170279,-0.569231,0.291209,0.189286,-0.00699301,-0.459341,-0.153846
1,Age_structure0-14,-0.0769231,-0.153846,-0.142857,-0.25,0.013986,0.0196078,0.0725275,-0.318681,0.325,-0.13986,-0.0197802,-0.0699301
2,Age_structure15-24,-0.247253,0.202797,-0.318681,-0.344118,-0.020979,0.467492,-0.0197802,-0.269231,0.164286,0.20979,-0.0857143,0.0559441
3,Age_structure25-54,0.032967,0.0979021,0.116484,0.352941,0.027972,-0.145511,0.134066,0.373626,-0.0785714,0.034965,0.0857143,-0.0559441
4,Age_structure55-64,0.247253,0.0909091,0.23956,0.235294,-0.0769231,-0.120743,-0.208791,0.247253,-0.15,0.013986,0.0813187,0.258741
5,Age_structureover65,0.252747,0.153846,0.120879,0.261765,-0.0699301,-0.147575,-0.0813187,0.186813,-0.185714,0.0,0.032967,0.328671
6,Median_age,0.175824,0.13986,0.231023,0.250184,-0.0629371,-0.181724,-0.142857,0.269231,-0.18588,-0.027972,0.107692,0.314685
7,Population_growth_rate,-0.255846,-0.409808,0.178022,-0.286976,0.507882,0.122934,0.107692,-0.352132,0.0964286,0.0560421,0.019802,-0.629371
8,Birth_rate,-0.0769231,-0.210158,-0.0637363,-0.211921,0.00699301,-0.0382034,0.0968097,-0.351648,0.321716,-0.202797,-0.0836084,-0.0875658
9,Death_rate,-0.10989,-0.304729,0.125275,0.102941,-0.276708,-0.31079,-0.178218,-0.032967,-0.00714924,-0.237762,-0.428571,0.503497


In [45]:
corr_result = correlations.drop(columns=['facts'])
res= corr_result.values

In [46]:
# Create Correlation Heatmap
trace = go.Heatmap(z=res,
                   x=TOPIC_LIST,
                   y=fact_columns[:-2])
layout_heat = go.Layout(
    barmode='stack',
    paper_bgcolor = 'rgba(0,0,0,0)',
    plot_bgcolor = 'rgba(0,0,0,0)',
    title = 'Pairwise Correlation Graph of Factbook Features and News Topics',
    margin = go.layout.Margin(
        t=30,
        l=30
    )   
)
fig = go.Figure(data=[trace], layout=layout_heat)
py.iplot(fig, filename='labelled-heatmap')

We can also see the all correlation results from the above heatmap and z shows the correlation value. 

### Checking the Significant Correlations in with Scatter Plots 

In [47]:
# Specific colors per country
countryColors = ["#d50000", "#c51162", "#aa00ff", "#6200ea", "#304ffe",
                 "#0091ea", "#00bfa5", "#00c853", "#64dd17", "#aeea00", 
                 "#ffd600", "#ffab00", "#ff6d00", "#dd2c00", "#3e2723", 
                 "#212121", "#546e7a", "#1b5e20", "#ce93d8","#f48fb1"]

In [48]:
# Found Significant Facts-Topics Pairs to Plot
significantFacts = ['Net_migration_rate','GDP_per_capita','Internet_users','Age_structure15-24',
                   'Population', 'Sex_ratio', 'Unemployment_rate','Population_growth_rate']
corres_topic = ['POLITICS', 'SPORTS', 'SPORTS', 'SOCIAL_LIFE/DAILY',
               'ENTERTAINMENT/ART/MAGAZINE', 'ECONOMY', 'LEGAL/LAW','HEALTH/MEDICAL']

In [49]:
# Scatter plots for Facts-Topics Correlations (Significant ones only)
for i in range(len(significantFacts)):
    trace = go.Scatter(
        x = data[significantFacts[i]],
        y = round(data[corres_topic[i]]*100,2),
        mode = 'markers',
        text = data.Country,
        textposition = 'bottom center',
        marker={
                "color": countryColors,
                 "size": 10
                }
    )
    layout = {
      "autosize": True, 
      "barmode": "stack", 
      "margin": {
        "t": 30, 
        "l": 30
      }, 
      "paper_bgcolor": "rgba(0, 0, 0, 0)", 
      "plot_bgcolor": "rgba(0, 0, 0, 0)", 
      "title": significantFacts[i].upper().replace("_", " ")+' vs '+corres_topic[i], 
      "xaxis": {
        "automargin": True, 
        "autorange": True, 
        "title": significantFacts[i].upper().replace("_", " ")
      }, 
      "yaxis": {
        "automargin": True, 
        "autorange": True, 
        "title":corres_topic[i]
      }
    }
    fig = go.Figure(data=[trace], layout=layout)
    cor_top = corres_topic[i].replace("/", "-")
    py.iplot(fig, filename=significantFacts[i]+'-vs-'+cor_top)


The results plotted saved under Plots folder for these scatter plots and added to the website. You can see them from here: 
- [Population_growth_rate-vs-HEALTH-MEDICAL](../Plots/Population_growth_rate-vs-HEALTH-MEDICAL.html)
- [Net_migration_rate-vs-POLITICS](../Plots/Net_migration_rate-vs-POLITICS.html)
- [Population-vs-ENTERTAINMENT-ART-MAGAZINE](../Plots/Population-vs-ENTERTAINMENT-ART-MAGAZINE.html)
- [Sex_ratio-vs-ECONOMY](../Plots/Sex_ratio-vs-ECONOMY.html)
- [Age_structure15-24-vs-SOCIAL_LIFE-DAILY](../Plots/Age_structure15-24-vs-SOCIAL_LIFE-DAILY.html)
- [Internet_users-vs-SPORTS](../Plots/Internet_users-vs-SPORTS.html)
- [GDP_per_capita-vs-SPORTS](../Plots/GDP_per_capita-vs-SPORTS.html)
- [Unemployment_rate-vs-LEGAL-LAW](../Plots/Unemployment_rate-vs-LEGAL-LAW.html)       

Overall, from the analysis of these two datasets we can say that, there are some meaningful correlation between the Country Profiles and the News content published. For example, we can see that Countries having higher net migration rates tends to publish more on Politics, Countries having younger people profile have a leaning towards to publish more on Social Life and Daily news etc. Therefore, we can conclude that from the data we have, News may have a lean toward and reflect some truth behind the saying "A good newspaper is a nation talking to itself" since there are some significant trends between topics published and country profiles. 

On the other hand, these correlations found are based on limited data and may not be exactly reflecting countries behaviors. Also, some other limitations might be human interpretations on the topic name assignment and the number of countries is only 20 therefore, not sufficient to do a general claim such as News are become more Globalized with the changing Technology. 
