# Analysis Notebook
In this notebook I will perform various data transformations to generate 6 total analytical summary tables of our dataset. First, lets start by loading in our final data table.

In [34]:
import pandas as pd

In [35]:
city_df = pd.read_csv("../data_final/wp_scored_city_articles_by_state.csv")
city_df.head()

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5049846,"Abbeville, Alabama",1171164000.0,C
1,Alabama,East South Central,5049846,"Adamsville, Alabama",1177621000.0,C
2,Alabama,East South Central,5049846,"Addison, Alabama",1168360000.0,C
3,Alabama,East South Central,5049846,"Akron, Alabama",1165910000.0,GA
4,Alabama,East South Central,5049846,"Alabaster, Alabama",1179140000.0,C


In [36]:
def get_articles_per_capita(df):
    # Grouping by state and population and counting the number of articles in each state.
    articles_per_capita = df[['state', 'population', 'article_title']].groupby(['state', 'population']).count()
    # Resetting columns due to hierarchy
    articles_per_capita = articles_per_capita.reset_index()
    # Renaming columns
    articles_per_capita.columns = ['state', 'population', 'num_articles']
    # Computing articles per capita
    articles_per_capita['articles_per_capita'] = articles_per_capita['num_articles'] / articles_per_capita['population']
    return articles_per_capita

1. Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .

In [37]:
top_ten_states_coverage = get_articles_per_capita(city_df)
top_ten_states_coverage.sort_values('articles_per_capita', ascending=False).head(10)


Unnamed: 0,state,population,num_articles,articles_per_capita
44,Vermont,646972,329,0.000509
33,North Dakota,777934,356,0.000458
18,Maine,1377238,483,0.000351
40,South Dakota,896164,311,0.000347
14,Iowa,3197689,1043,0.000326
1,Alaska,734182,149,0.000203
37,Pennsylvania,13012059,2556,0.000196
21,Michigan,10037504,1773,0.000177
49,Wyoming,579483,99,0.000171
28,New Hampshire,1387505,234,0.000169


2. Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .


To show the ten states with the lowest articles per capita we can simply manipulate our previous result by flipping the sorting to ascending order and taking the top ten.

In [38]:
top_ten_states_coverage.sort_values('articles_per_capita', ascending=True).head(10)

Unnamed: 0,state,population,num_articles,articles_per_capita
26,Nebraska,1963554,0,0.0
6,Connecticut,3623355,0,0.0
32,North Carolina,10565885,50,5e-06
27,Nevada,3146402,19,6e-06
4,California,39142991,482,1.2e-05
2,Arizona,7264877,91,1.3e-05
45,Virginia,8657365,133,1.5e-05
35,Oklahoma,3991225,75,1.9e-05
8,Florida,21828069,412,1.9e-05
15,Kansas,2937922,63,2.1e-05


Here we see that Nebraska and Connecticut have the lowest articles per capita with 0 articles.

3. Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) .


Now we are introducing another filter for our dataset. We only want the highest quality of articles. From the Homework specification, we have defined high quality articles as articles that were classified by ORES as a "featured article" (FA) or a "good article" (GA). To filter our original dataset we simply pass a mask into our dataframe.

In [39]:
# Filtering only for good quality articles.
high_quality = city_df[(city_df['article_quality'] == 'FA') | (city_df['article_quality'] == 'GA')]

# Retrieving articles per capita of high_quality articles
high_quality_per_capita = get_articles_per_capita(high_quality)

# Sorting in descending order
high_quality_per_capita.sort_values("articles_per_capita", ascending=False).head(10)



Unnamed: 0,state,population,num_articles,articles_per_capita
42,Vermont,646972,45,7e-05
47,Wyoming,579483,39,6.7e-05
38,South Dakota,896164,56,6.2e-05
45,West Virginia,1785526,105,5.9e-05
24,Montana,1106227,55,5e-05
26,New Hampshire,1387505,63,4.5e-05
35,Pennsylvania,13012059,566,4.3e-05
23,Missouri,6169823,263,4.3e-05
1,Alaska,734182,31,4.2e-05
27,New Jersey,9267961,379,4.1e-05


4. Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).


In [40]:
high_quality_per_capita.sort_values('articles_per_capita', ascending=False).head(10)

Unnamed: 0,state,population,num_articles,articles_per_capita
42,Vermont,646972,45,7e-05
47,Wyoming,579483,39,6.7e-05
38,South Dakota,896164,56,6.2e-05
45,West Virginia,1785526,105,5.9e-05
24,Montana,1106227,55,5e-05
26,New Hampshire,1387505,63,4.5e-05
35,Pennsylvania,13012059,566,4.3e-05
23,Missouri,6169823,263,4.3e-05
1,Alaska,734182,31,4.2e-05
27,New Jersey,9267961,379,4.1e-05


5. Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.


To compute this table this requires a bit more complexity. Now we want to group by the `regional_division` column, but while we want to count the number of articles, we want to sum up the populations. This requires a more complex aggregation step.

In [41]:
agg_params = {"population": ["sum"], "article_title": ["count"]}
census_divisions = city_df[['regional_division', 'population', 'article_title']].groupby(['regional_division']).agg(agg_params)
# Resetting columns due to hierarchy
census_divisions = census_divisions.reset_index()
# Renaming columns
census_divisions.columns = ['regional_division', 'population', 'num_articles']
# Computing articles per capita
census_divisions['articles_per_capita'] = census_divisions['num_articles'] / census_divisions['population']
# Sorting and displaying
census_divisions.sort_values('articles_per_capita', ascending=False).head()


Unnamed: 0,regional_division,population,num_articles,articles_per_capita
4,New England,3709512562,1437,3.873824e-07
3,Mountain,4063736139,1189,2.925879e-07
7,West North Central,14822991682,3578,2.413818e-07
1,East South Central,7528146572,1529,2.031044e-07
6,South Atlantic,19325152712,1850,9.573016e-08


6. Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.


The final table can easily be computed since we can use the census divisions dataframe that we already created and just sort it in descending order.

In [33]:
agg_params = {"population": ["sum"], "article_title": ["count"]}
census_divisions = high_quality[['regional_division', 'population', 'article_title']].groupby(['regional_division']).agg(agg_params)
# Resetting columns due to hierarchy
census_divisions = census_divisions.reset_index()
# Renaming columns
census_divisions.columns = ['regional_division', 'population', 'num_articles']
# Computing articles per capita
census_divisions['articles_per_capita'] = census_divisions['num_articles'] / census_divisions['population']
# Sorting and displaying
census_divisions.sort_values('articles_per_capita', ascending=False).head()


Unnamed: 0,regional_division,population,num_articles,articles_per_capita
4,New England,622272389,225,3.61578e-07
3,Mountain,1077820209,336,3.117403e-07
7,West North Central,3049018051,638,2.092477e-07
1,East South Central,1751568880,316,1.804097e-07
6,South Atlantic,4952989532,524,1.057947e-07


Looking at the results, it seems like the order of the regions doesn't seem to change whether we are looking at high quality articles or just total articles.