## Purpose

The purpose of this code is to analyze data on total articles/population and quality of articles/population on a state-by-state and divisional basis. High quality articles will be ranked with either "FA" or "GA". We will begin by reading in Python libraries and uploading our table for analysis.

In [21]:
#Import python libraries
import json
import time
import urllib.parse
import requests
import pandas as pd

#Read in file for analysis
ranking_df = pd.read_csv('../clean_data/wp_scored_city_articles_by_state.csv')
ranking_df.head()

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5074296,"Abbeville, Alabama",1171163550,C
1,Alabama,East South Central,5074296,"Adamsville, Alabama",1177621427,C
2,Alabama,East South Central,5074296,"Addison, Alabama",1168359898,C
3,Alabama,East South Central,5074296,"Akron, Alabama",1165909508,GA
4,Alabama,East South Central,5074296,"Alabaster, Alabama",1179139816,C


First we will find the count of articles per state population and add it to its own dataframe.

In [22]:
#Summing articles grouped by state
count_by_state = pd.DataFrame({
    'count_articles': ranking_df.groupby(['state']).size()}).reset_index()
count_by_state.head()

#Merging count into ranking_df
ranking_count_by_state = pd.merge(ranking_df, count_by_state, how = 'left',
                              on = 'state')

#Verifying merge is correct
if len(ranking_count_by_state) == len(ranking_df):
    print("Merge successful, {0} rows".format(len(ranking_df)))
else:
    print("ERROR in merge")

#Removing cols we dont need and deduping
tot_art_per_pop = ranking_count_by_state[['state','regional_division',
                                         'population',
                                          'count_articles']]
tot_art_per_pop = tot_art_per_pop.drop_duplicates().reset_index(drop = True)
tot_art_per_pop.head()

Merge successful, 21514 rows


Unnamed: 0,state,regional_division,population,count_articles
0,Alabama,East South Central,5074296,461
1,Alaska,Pacific,733583,149
2,Arizona,Mountain,7359197,91
3,Arkansas,West South Central,3045637,500
4,California,Pacific,39029342,482


Now we will calculate the total articles per population by state.

In [23]:
#Creating total per population col
total_per_state = tot_art_per_pop.copy()
total_per_state['art_per_pop'] = (total_per_state['count_articles'] / total_per_state['population'])
total_per_state.head()

Unnamed: 0,state,regional_division,population,count_articles,art_per_pop
0,Alabama,East South Central,5074296,461,9.1e-05
1,Alaska,Pacific,733583,149,0.000203
2,Arizona,Mountain,7359197,91,1.2e-05
3,Arkansas,West South Central,3045637,500,0.000164
4,California,Pacific,39029342,482,1.2e-05


Now we will print and save the top 10 US states by coverage.

In [24]:
#Getting top coverage
print("The top 10 states by article coverage are:")
top_10_coverage = total_per_state.sort_values('art_per_pop', ascending=False).reset_index(drop = True)[:10]
print(top_10_coverage)

#Saving top coverage
top_10_coverage.to_csv('../results/top_10_coverage.csv')

The top 10 states by article coverage are:
           state   regional_division  population  count_articles  art_per_pop
0        Vermont         New England      647064             329     0.000508
1   North Dakota  West North Central      779261             356     0.000457
2          Maine         New England     1385340             483     0.000349
3   South Dakota  West North Central      909824             311     0.000342
4           Iowa  West North Central     3200517            1042     0.000326
5         Alaska             Pacific      733583             149     0.000203
6   Pennsylvania     Middle Atlantic    12972008            2556     0.000197
7       Michigan  East North Central    10034113            1773     0.000177
8        Wyoming            Mountain      581381              99     0.000170
9  New Hampshire         New England     1395231             234     0.000168


Next we will print and save the bottom 10 US states by coverage.

In [25]:
#Getting bottom coverage
print("The bottom 10 states by article coverage are:")
bottom_10_coverage = total_per_state.sort_values('art_per_pop', ascending=True).reset_index(drop = True)[:10]
print(bottom_10_coverage)

#Saving bottom coverage
bottom_10_coverage.to_csv('../results/bottom_10_coverage.csv')

The bottom 10 states by article coverage are:
            state   regional_division  population  count_articles  art_per_pop
0  North Carolina      South Atlantic    10698973              50     0.000005
1          Nevada            Mountain     3177772              19     0.000006
2      California             Pacific    39029342             482     0.000012
3         Arizona            Mountain     7359197              91     0.000012
4        Virginia      South Atlantic     8683619             133     0.000015
5         Florida      South Atlantic    22244823             411     0.000018
6        Oklahoma  West South Central     4019800              75     0.000019
7          Kansas  West North Central     2937150              63     0.000021
8        Maryland      South Atlantic     6164660             157     0.000025
9       Wisconsin  East North Central     5892539             191     0.000032


Now we will rank, print, and save the census divisions by article coverage.

In [32]:
#Grouping the populations and count of articles by division
tot_by_reg = total_per_state.groupby('regional_division').sum().reset_index()

#Recalculating the new art_per_pop column
tot_by_reg['art_per_pop'] = (tot_by_reg['count_articles'] /
                             tot_by_reg['population'])

#Showing in descending order
tot_by_reg = tot_by_reg.sort_values('art_per_pop', ascending=False).reset_index(drop = True)

#Printing results
print("The Census Divisions in descending order of articles per population are:")
print(tot_by_reg)

#Saving table
tot_by_reg.to_csv('../results/census_reg_by_tot_coverage.csv')

The Census Divisions in descending order of articles per population are:
    regional_division  population  count_articles  art_per_pop
0  West North Central    19721893            3577     0.000181
1         New England    11503343            1437     0.000125
2  East North Central    47097779            4753     0.000101
3     Middle Atlantic    41910858            3780     0.000090
4  East South Central    19578002            1528     0.000078
5  West South Central    41685250            2099     0.000050
6            Mountain    25514320            1187     0.000047
7      South Atlantic    66781137            1849     0.000028
8             Pacific    53229044            1304     0.000024


Next we will analyze high-quality articles per population. We consider high quality articles those which have a rating of FA or GA by ORES. We will begin by filtering our large data file to only contain articles which are of high quality.

In [27]:
#Limiting the dataset by ranking
quality_df = ranking_df.loc[ranking_df['article_quality'].isin(['GA','FA'])]

Then we will create summary statistics (count of articles) by state as well as the articles/population by state.

In [28]:
#Summing articles grouped by state
good_count_by_state = pd.DataFrame({
    'count_articles': quality_df.groupby(['state']).size()}).reset_index()

#Merging count into ranking_df
ranking_good_count_by_state = pd.merge(quality_df, good_count_by_state, 
                                       how = 'left', on = 'state')

#Verifying merge is correct
if len(ranking_good_count_by_state) == len(quality_df):
    print("Merge successful, {0} rows".format(len(quality_df)))
else:
    print("ERROR in merge")

#Removing cols we dont need and deduping
good_art_per_pop = ranking_good_count_by_state[['state','regional_division',
                                         'population',
                                          'count_articles']]
good_art_per_pop = good_art_per_pop.drop_duplicates().reset_index(drop = True)

#Creating total per population col
total_good_per_state = good_art_per_pop.copy()
total_good_per_state['art_per_pop'] = (total_good_per_state['count_articles'] 
                                       / total_good_per_state['population'])
total_good_per_state.head()

Merge successful, 4928 rows


Unnamed: 0,state,regional_division,population,count_articles,art_per_pop
0,Alabama,East South Central,5074296,53,1e-05
1,Alaska,Pacific,733583,31,4.2e-05
2,Arizona,Mountain,7359197,24,3e-06
3,Arkansas,West South Central,3045637,72,2.4e-05
4,California,Pacific,39029342,173,4e-06


Now we will print and save the top 10 US states by article quality.

In [29]:
#Getting top coverage
print("The top 10 states by article quality are:")
top_10_quality = total_good_per_state.sort_values('art_per_pop', ascending=False).reset_index(drop = True)[:10]
print(top_10_quality)

#Saving top coverage
top_10_quality.to_csv('../results/top_10_quality.csv')

The top 10 states by article quality are:
           state   regional_division  population  count_articles  art_per_pop
0        Vermont         New England      647064              45     0.000070
1        Wyoming            Mountain      581381              39     0.000067
2   South Dakota  West North Central      909824              56     0.000062
3  West Virginia      South Atlantic     1775156             105     0.000059
4        Montana            Mountain     1122867              55     0.000049
5  New Hampshire         New England     1395231              63     0.000045
6   Pennsylvania     Middle Atlantic    12972008             566     0.000044
7       Missouri  West North Central     6177957             263     0.000043
8         Alaska             Pacific      733583              31     0.000042
9     New Jersey     Middle Atlantic     9261699             379     0.000041


Next we will print and save the bottom 10 US states by article quality.

In [30]:
#Getting bottom coverage
print("The bottom 10 states by article quality are:")
bottom_10_quality = total_good_per_state.sort_values('art_per_pop', ascending=True).reset_index(drop = True)[:10]
print(bottom_10_quality)

#Saving top coverage
bottom_10_quality.to_csv('../results/bottom_10_quality.csv')

The bottom 10 states by article quality are:
            state   regional_division  population  count_articles  art_per_pop
0  North Carolina      South Atlantic    10698973              20     0.000002
1        Virginia      South Atlantic     8683619              18     0.000002
2          Nevada            Mountain     3177772               8     0.000003
3         Arizona            Mountain     7359197              24     0.000003
4      California             Pacific    39029342             173     0.000004
5         Florida      South Atlantic    22244823             118     0.000005
6        New York     Middle Atlantic    19677151             111     0.000006
7        Maryland      South Atlantic     6164660              42     0.000007
8          Kansas  West North Central     2937150              22     0.000007
9        Oklahoma  West South Central     4019800              31     0.000008


Now we will rank, print, and save the census divisions by article quality.

In [33]:
#Grouping the populations and count of articles by division
tot_good_by_reg = total_good_per_state.groupby('regional_division').sum().reset_index()

#Recalculating the new art_per_pop column
tot_good_by_reg['art_per_pop'] = (tot_good_by_reg['count_articles'] /
                             tot_good_by_reg['population'])

#Showing in descending order
tot_good_by_reg = tot_good_by_reg.sort_values('art_per_pop', ascending=False).reset_index(drop = True)

#Printing results
print("The Census Divisions in descending order of quality articles per population are:")
print(tot_good_by_reg)

#Saving table
tot_good_by_reg.to_csv('../results/census_reg_by_quality_coverage.csv')

The Census Divisions in descending order of quality articles per population are:
    regional_division  population  count_articles  art_per_pop
0  West North Central    19721893             637     0.000032
1     Middle Atlantic    41910858            1056     0.000025
2         New England    11503343             225     0.000020
3  East South Central    19578002             316     0.000016
4  West South Central    41685250             633     0.000015
5  East North Central    47097779             712     0.000015
6            Mountain    25514320             335     0.000013
7             Pacific    53229044             490     0.000009
8      South Atlantic    66781137             524     0.000008
