# Analysis

Here, I attempt to obtain the datasets that were posed in the assignment. Here are the questions posed:
1. Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .
2. Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .
3. Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) .
4. Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).
5. Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.
6. Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.


## Pre processing

In [1]:
#getting the data to the workspace for analysis
import pandas as pd
df = pd.read_csv("data/wp_scored_city_articles_by_state.csv")

Creating all the tables for further results:
- We create a new table with state, regional_division, population, total and high quality articles for each state, and its per capita information.

In [2]:
# Filter high-quality articles
high_quality_articles = df[df['article_quality'].isin(['FA', 'GA'])]

# Calculate the total and high-quality articles per state
total_articles_per_state = df.groupby(['regional_division', 'state'])['article_title'].count().reset_index()
high_quality_articles_per_state = high_quality_articles.groupby(['regional_division', 'state'])['article_title'].count().reset_index()

# Merge population data
result = pd.merge(total_articles_per_state, high_quality_articles_per_state, on=['regional_division', 'state'], suffixes=('_total', '_high_quality'))
result = pd.merge(result, df[['regional_division', 'state', 'population']], on=['regional_division', 'state'])

# Calculate the per capita ratios
result['total-articles-per-population'] = result['article_title_total'] / result['population']
result['high-quality-articles-per-population'] = result['article_title_high_quality'] / result['population']

# Rename columns
result.rename(columns={'article_title_total': 'total_articles', 'article_title_high_quality': 'high-quality-articles'}, inplace=True)

# Reorder columns
result = result[['state', 'regional_division', 'population', 'total-articles-per-population', 'high-quality-articles-per-population', 'total_articles', 'high-quality-articles']]

result = result.drop_duplicates().reset_index(drop = True)

We can just print the result table to ckeck if anything is wierd

In [3]:
result.sort_values('state').reset_index(drop = True) #added reset_index to make it look cleaner

Unnamed: 0,state,regional_division,population,total-articles-per-population,high-quality-articles-per-population,total_articles,high-quality-articles
0,Alabama,East South Central,5074296.0,9.1e-05,1e-05,461,53
1,Alaska,Pacific,733583.0,0.000203,4.2e-05,149,31
2,Arizona,Mountain,7359197.0,1.2e-05,3e-06,91,24
3,Arkansas,West South Central,3045637.0,0.000164,2.4e-05,500,72
4,California,Pacific,39029342.0,1.2e-05,4e-06,482,173
5,Colorado,Mountain,5839926.0,4.9e-05,1.3e-05,288,76
6,Delaware,South Atlantic,1018396.0,5.6e-05,2.5e-05,57,25
7,Florida,South Atlantic,22244823.0,1.8e-05,5e-06,411,118
8,Georgia,South Atlantic,10912876.0,4.9e-05,9e-06,538,93
9,Hawaii,Pacific,1440196.0,0.000105,2.1e-05,151,30


## Results

As the results table contain everything we need for this analysis, we can now filter the tables based on the question to obtain required results.

1. Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .

In [17]:
df1 = result.sort_values(by='total-articles-per-population', ascending=False).head(10)

#remove unwanted clutter/information
df1.drop(['high-quality-articles','high-quality-articles-per-population'],inplace=True,axis=1)
df1.reset_index(drop=True)

Unnamed: 0,state,regional_division,population,total-articles-per-population,total_articles
0,Vermont,New England,647064.0,0.000508,329
1,North Dakota,West North Central,779261.0,0.000457,356
2,Maine,New England,1385340.0,0.000349,483
3,South Dakota,West North Central,909824.0,0.000342,311
4,Iowa,West North Central,3200517.0,0.000326,1042
5,Alaska,Pacific,733583.0,0.000203,149
6,Pennsylvania,Middle Atlantic,12972008.0,0.000197,2556
7,Michigan,East North Central,10034113.0,0.000177,1773
8,Wyoming,Mountain,581381.0,0.00017,99
9,New Hampshire,New England,1395231.0,0.000168,234


2. Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .

In [16]:
df2 = result.sort_values(by='total-articles-per-population').head(10)

#remove unwanted clutter/information
df2.drop(['high-quality-articles','high-quality-articles-per-population'],inplace=True,axis=1) 
df2.reset_index(drop=True)

Unnamed: 0,state,regional_division,population,total-articles-per-population,total_articles
0,North Carolina,South Atlantic,10698973.0,5e-06,50
1,Nevada,Mountain,3177772.0,6e-06,19
2,California,Pacific,39029342.0,1.2e-05,482
3,Arizona,Mountain,7359197.0,1.2e-05,91
4,Virginia,South Atlantic,8683619.0,1.5e-05,133
5,Florida,South Atlantic,22244823.0,1.8e-05,411
6,Oklahoma,West South Central,4019800.0,1.9e-05,75
7,Kansas,West North Central,2937150.0,2.1e-05,63
8,Maryland,South Atlantic,6164660.0,2.5e-05,157
9,Wisconsin,East North Central,5892539.0,3.2e-05,191


3. Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) .

In [18]:
df3 = result.sort_values(by='high-quality-articles-per-population', ascending=False).head(10)

#remove unwanted clutter/information
df3.drop(['total_articles','total-articles-per-population'],inplace=True,axis=1)
df3.reset_index(drop=True)

Unnamed: 0,state,regional_division,population,high-quality-articles-per-population,high-quality-articles
0,Vermont,New England,647064.0,7e-05,45
1,Wyoming,Mountain,581381.0,6.7e-05,39
2,South Dakota,West North Central,909824.0,6.2e-05,56
3,West Virginia,South Atlantic,1775156.0,6e-05,106
4,Montana,Mountain,1122867.0,4.9e-05,55
5,New Hampshire,New England,1395231.0,4.5e-05,63
6,Pennsylvania,Middle Atlantic,12972008.0,4.4e-05,566
7,Missouri,West North Central,6177957.0,4.3e-05,263
8,Alaska,Pacific,733583.0,4.2e-05,31
9,New Jersey,Middle Atlantic,9261699.0,4.1e-05,379


4. Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).

In [19]:
df4 = result.sort_values(by='high-quality-articles-per-population').head(10)

#remove unwanted clutter/information
df4.drop(['total_articles','total-articles-per-population'],inplace=True,axis=1)
df4.reset_index(drop=True)

Unnamed: 0,state,regional_division,population,high-quality-articles-per-population,high-quality-articles
0,North Carolina,South Atlantic,10698973.0,2e-06,20
1,Virginia,South Atlantic,8683619.0,2e-06,18
2,Nevada,Mountain,3177772.0,3e-06,8
3,Arizona,Mountain,7359197.0,3e-06,24
4,California,Pacific,39029342.0,4e-06,173
5,Florida,South Atlantic,22244823.0,5e-06,118
6,New York,Middle Atlantic,19677151.0,6e-06,111
7,Maryland,South Atlantic,6164660.0,7e-06,42
8,Kansas,West North Central,2937150.0,7e-06,22
9,Oklahoma,West South Central,4019800.0,8e-06,31


5. Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.

In [13]:
df5 = result.groupby('regional_division').agg({
    'population': 'sum',
    'total_articles': 'sum'
}).reset_index()

# Rename the columns for clarity
df5 = df5.rename(columns={'population': 'sum_population', 'total_articles': 'sum_total_articles'})

df5['total_articles_per_capita'] = df5['sum_total_articles']/df5['sum_population']

df5.sort_values("total_articles_per_capita",ascending= False).reset_index(drop = True)

Unnamed: 0,regional_division,sum_population,sum_total_articles,total_articles_per_capita
0,West North Central,19721893.0,3577,0.000181
1,New England,11503343.0,1437,0.000125
2,East North Central,47097779.0,4753,0.000101
3,Middle Atlantic,41910858.0,3780,9e-05
4,East South Central,19578002.0,1528,7.8e-05
5,West South Central,41685250.0,2100,5e-05
6,Mountain,25514320.0,1187,4.7e-05
7,South Atlantic,66781137.0,1849,2.8e-05
8,Pacific,53229044.0,1304,2.4e-05


6. Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.

In [11]:
df6 = result.groupby('regional_division').agg({
    'population': 'sum',
    'high-quality-articles': 'sum'
}).reset_index()

# Rename the columns for clarity
df6 = df6.rename(columns={'population': 'sum_population', 'high-quality-articles': 'sum_high-quality-articles'})

df6['high_quality_articles_per_capita'] = df6['sum_high-quality-articles']/df6['sum_population']

df6.sort_values("high_quality_articles_per_capita",ascending= False).reset_index(drop = True)

Unnamed: 0,regional_division,sum_population,sum_high-quality-articles,high_quality_articles_per_capita
0,West North Central,19721893.0,639,3.2e-05
1,Middle Atlantic,41910858.0,1056,2.5e-05
2,New England,11503343.0,225,2e-05
3,East South Central,19578002.0,316,1.6e-05
4,West South Central,41685250.0,634,1.5e-05
5,East North Central,47097779.0,716,1.5e-05
6,Mountain,25514320.0,335,1.3e-05
7,Pacific,53229044.0,490,9e-06
8,South Atlantic,66781137.0,525,8e-06
