# Considering Bias in Data

The following analysis will help us gain a better understanding of how biases can manifest in datasets, particularly in the context of publicly contributed information sources such as Wikipedia. 

The analysis will cover the following aspects:
1. Coverage Disparity of US Cities on Wikipedia: An assessment of the differences in coverage among various US states in relation to their population.
2. Quality Disparity of Wikipedia Articles: Examination of the variation in the proportion of high-quality articles about cities across different states.
3. Ranking of US Geographic Regions: A ranking system based on articles-per-person ratio and the proportion of high-quality articles to offer insights into the bias and disparities present in Wikipedia data.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('wp_scored_city_articles_by_state.csv')
df.head()

Unnamed: 0,state,population,regional_division,article_title,revision_id,article_quality
0,Alabama,5074296,East South Central,"Abbeville, Alabama",1171163550,C
1,Alabama,5074296,East South Central,"Adamsville, Alabama",1177621427,C
2,Alabama,5074296,East South Central,"Addison, Alabama",1168359898,C
3,Alabama,5074296,East South Central,"Akron, Alabama",1165909508,GA
4,Alabama,5074296,East South Central,"Alabaster, Alabama",1179139816,C


## Top 10 US states by coverage
The 10 US states with the highest total articles per capita (in descending order)

In [3]:
# Group by state and count the number of articles
articles_per_state = df.groupby('state')['article_title'].count().reset_index()

# Getting unique values of state along with corresponding population
population_per_state= df[['state', 'population']].drop_duplicates()

# Merge data
articles_per_capita = pd.merge(articles_per_state,population_per_state, on='state')

# Calculate the article per capita
articles_per_capita['article_per_capita'] = articles_per_capita['article_title'] / articles_per_capita['population']

articles_per_capita.head()


Unnamed: 0,state,article_title,population,article_per_capita
0,Alabama,461,5074296,9.1e-05
1,Alaska,149,733583,0.000203
2,Arizona,91,7359197,1.2e-05
3,Arkansas,500,3045637,0.000164
4,California,482,39029342,1.2e-05


In [9]:
top_10=articles_per_capita.sort_values(by='article_per_capita', ascending=False)[:10].reset_index(drop=True)
top_10

Unnamed: 0,state,article_title,population,article_per_capita
0,Vermont,329,647064,0.000508
1,North Dakota,356,779261,0.000457
2,Maine,483,1385340,0.000349
3,South Dakota,311,909824,0.000342
4,Iowa,1043,3200517,0.000326
5,Alaska,149,733583,0.000203
6,Pennsylvania,2556,12972008,0.000197
7,Michigan,1773,10034113,0.000177
8,Wyoming,99,581381,0.00017
9,New Hampshire,234,1395231,0.000168


## Bottom 10 US states by coverage
The 10 US states with the lowest total articles per capita (in ascending order)

In [10]:
bottom_10=articles_per_capita.sort_values(by='article_per_capita', ascending=True)[:10].reset_index(drop=True)
bottom_10

Unnamed: 0,state,article_title,population,article_per_capita
0,North Carolina,50,10698973,5e-06
1,Nevada,19,3177772,6e-06
2,California,482,39029342,1.2e-05
3,Arizona,91,7359197,1.2e-05
4,Virginia,133,8683619,1.5e-05
5,Florida,412,22244823,1.9e-05
6,Oklahoma,75,4019800,1.9e-05
7,Kansas,63,2937150,2.1e-05
8,Maryland,157,6164660,2.5e-05
9,Wisconsin,192,5892539,3.3e-05


## Top 10 US states by high quality
The 10 US states with the highest high quality articles per capita (in descending order). 

For this analysis you "high quality" articles are articles that ORES predicted in either the "FA" (featured article) or "GA" (good article) classes.



In [12]:
# Group by state and count the number of high-quality articles
high_quality_articles = df[df['article_quality'].isin(['GA', 'FA'])].groupby('state')['article_quality'].count().reset_index()

# Merge with population data
hq_articles_per_capita = pd.merge(high_quality_articles,population_per_state, on='state')

# Calculate the high quality article per capita
hq_articles_per_capita['article_quality'] = hq_articles_per_capita['article_quality'] / hq_articles_per_capita['population']

hq_articles_per_capita.head()

Unnamed: 0,state,article_quality,population
0,Alabama,1e-05,5074296
1,Alaska,4.2e-05,733583
2,Arizona,3e-06,7359197
3,Arkansas,2.4e-05,3045637
4,California,4e-06,39029342


In [15]:
top_10_hq=hq_articles_per_capita.sort_values(by='article_quality', ascending=False)[:10].reset_index(drop=True)
top_10_hq

Unnamed: 0,state,article_quality,population
0,Vermont,7e-05,647064
1,Wyoming,6.7e-05,581381
2,South Dakota,6.2e-05,909824
3,West Virginia,5.9e-05,1775156
4,Montana,4.9e-05,1122867
5,New Hampshire,4.5e-05,1395231
6,Pennsylvania,4.4e-05,12972008
7,Missouri,4.3e-05,6177957
8,Alaska,4.2e-05,733583
9,New Jersey,4.1e-05,9261699


## Bottom 10 US states by high quality
The 10 US states with the lowest high quality articles per capita (in ascending order)


In [16]:
bottom_10_hq=hq_articles_per_capita.sort_values(by='article_quality', ascending=True)[:10].reset_index(drop=True)
bottom_10_hq

Unnamed: 0,state,article_quality,population
0,North Carolina,2e-06,10698973
1,Virginia,2e-06,8683619
2,Nevada,3e-06,3177772
3,Arizona,3e-06,7359197
4,California,4e-06,39029342
5,Florida,5e-06,22244823
6,New York,6e-06,19677151
7,Maryland,7e-06,6164660
8,Kansas,7e-06,2937150
9,Oklahoma,8e-06,4019800


## Census divisions by total coverage
A rank ordered list of US census divisions (in descending order) by total articles per capita

In [4]:
# Group by region and count the number of articles
articles_per_region = df.groupby('regional_division')['article_title'].count().reset_index()

## Group by region and sum the population
population_per_region= df.groupby('regional_division')['population'].sum().reset_index()

# Merge data
articles_per_region = pd.merge(articles_per_region,population_per_region, on='regional_division')

# Calculate the article per capita
articles_per_region['article_per_capita'] = articles_per_region['article_title'] / articles_per_region['population']

# Convert scientific notation to decimal
articles_per_region['article_per_capita'] = articles_per_region['article_per_capita'].apply(lambda x: '%.9f' % x)

articles_per_region

Unnamed: 0,regional_division,article_title,population,article_per_capita
0,East North Central,4754,50000102986,9.5e-08
1,East South Central,1529,7567764699,2.02e-07
2,Middle Atlantic,3781,51386647495,7.4e-08
3,Mountain,1189,4100790927,2.9e-07
4,New England,1437,3708797804,3.87e-07
5,Pacific,1304,22348595190,5.8e-08
6,South Atlantic,1850,19595194539,9.4e-08
7,West North Central,3578,14841264104,2.41e-07
8,West South Central,2103,39975932892,5.3e-08


## Census divisions by high quality coverage
Rank ordered list of US census divisions (in descending order) by high quality articles per capita

In [5]:
# Group by state and count the number of high-quality articles
hq_articles_region = df[df['article_quality'].isin(['GA', 'FA'])].groupby('regional_division')['article_title'].count().reset_index()

# Merge with population data
hq_articles_region = pd.merge(hq_articles_region,population_per_region, on='regional_division')

# Calculate the high quality article per capita
hq_articles_region['hq_article_per_capita'] = hq_articles_region['article_title'] / hq_articles_region['population']

# Convert scientific notation to decimal
hq_articles_region['hq_article_per_capita'] = hq_articles_region['hq_article_per_capita'].apply(lambda x: '%.9f' % x)


hq_articles_region

Unnamed: 0,regional_division,article_title,population,hq_article_per_capita
0,East North Central,711,50000102986,1.4e-08
1,East South Central,315,7567764699,4.2e-08
2,Middle Atlantic,1056,51386647495,2.1e-08
3,Mountain,336,4100790927,8.2e-08
4,New England,225,3708797804,6.1e-08
5,Pacific,491,22348595190,2.2e-08
6,South Atlantic,524,19595194539,2.7e-08
7,West North Central,638,14841264104,4.3e-08
8,West South Central,638,39975932892,1.6e-08
