## Purpose

The purpose of this code is to process data on article quality predictions from ORES. 

The article source data comes from English Wikipedia, the text of which is licensed under "Creative Commons Attribution Share-Alike license" (https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License)

We will be using the MediaWiki REST API for English Wikipedia. To get more information on the API please use the following link: https://www.mediawiki.org/wiki/API:Main_page. The following link may also be helpful when looking for more documentation: https://www.mediawiki.org/wiki/API:Info

We will leverage code developed by Dr. David W. McDonald for use in Data 512  which is provided under Creative Commons CC-BY license. (https://creativecommons.org/ and https://creativecommons.org/licenses/by/4.0/). The file can be found at this link: https://drive.google.com/drive/folders/1FtvWV31DHE8HIMdEsPGuCXPz0PMvShfl

We will also be using the ORES API. Information on the API itself can be found at https://www.mediawiki.org/wiki/ORES, with original API documentation from https://ores.wikimedia.org/docs and new LiftWing documentation from https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage.

We will begin by reading in standard Python libraries.

In [96]:
#Import python libraries
import json
import time
import urllib.parse
import requests
import pandas as pd

To conduct our final analysis we will need to combine data with state per article, rating and revision id per article, population by state, and regional division by state. We start by reading in the rating per article and turning it into a dataframe.

In [97]:
#Loading rating per article as var
ores_scores_file = open('../raw_data/final_ores_scores.json')
 
#Makes ores_scores a dictionary
ores_scores_dict = json.load(ores_scores_file)

#Getting list of keys and values
ores_keys = list(ores_scores_dict.keys())
ores_values = list(ores_scores_dict.values())

#Combining into a pandas df
ores_scores_df = pd.DataFrame({'article_title':ores_keys,
                               'article_quality':ores_values})
ores_scores_df.head()

Unnamed: 0,article_title,article_quality
0,"Abbeville, Alabama",C
1,"Adamsville, Alabama",C
2,"Addison, Alabama",C
3,"Akron, Alabama",GA
4,"Alabaster, Alabama",C


Next we will read in the revision id per article and turn it into a dataframe. This dataframe will be merged using article title with the rating per article title dataframe made in the section of code above.

In [98]:
#Loading rating per article as var
page_info_file = open('../raw_data/page_info.json')
 
#Makes ores_scores a dictionary
page_info_dict = json.load(page_info_file)

#Getting list of keys and values
page_info_keys = list(page_info_dict.keys())
page_info_values = list(page_info_dict.values())

#Combining into a pandas df
page_info_df = pd.DataFrame({'article_title':page_info_keys,
                             'revision_id':page_info_values})
page_info_df.head()

Unnamed: 0,article_title,revision_id
0,"Abbeville, Alabama",1171163550
1,"Adamsville, Alabama",1177621427
2,"Addison, Alabama",1168359898
3,"Akron, Alabama",1165909508
4,"Alabaster, Alabama",1179139816


Next we will read in the list of article titles by states and turn it into a dataframe. Some states have non-standard names or formats (e.g., Georgia appears as 'Georgia (U.S. state)' and two-name states have "_") which we will update here for later joining.

In [99]:
#Read in articles per state using pandas
articles_by_st = pd.read_csv('../raw_data/clean_us_cities_by_state.csv')
articles_by_st.head()

#Keep only the first 2 columns which are relevant for later analysis
articles_by_st_df = articles_by_st[['state','page_title']]

#Renaming the columns to get the final names
articles_by_st_df.columns = ['state','article_title']

#Fixing state names to align with government state names
articles_by_st_df['state'] = articles_by_st_df['state'].str.replace('_', ' ')
articles_by_st_df = articles_by_st_df.replace('Georgia (U.S. state)', 
                                              'Georgia')
articles_by_st_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  articles_by_st_df['state'] = articles_by_st_df['state'].str.replace('_', ' ')


Unnamed: 0,state,article_title
0,Alabama,"Abbeville, Alabama"
1,Alabama,"Adamsville, Alabama"
2,Alabama,"Addison, Alabama"
3,Alabama,"Akron, Alabama"
4,Alabama,"Alabaster, Alabama"


Now, we will read in the population per state. We will remove rows which are not in the 50 states and columns which do not relate to states and their 2022 population estimates.

In [100]:
#Read in info using pandas
pop_by_st = pd.read_csv('../raw_data/NST-EST2022-POP.csv')
pop_by_st.head()

#Update the col names
pop_by_st.columns = ['geographic_area','April 1, 2020 Estimates Base',
                        '2020_pop_est','2021_pop_est','2022_pop_est']

#Remove unused columns & renaming remaining
pop_by_st_df = pop_by_st[['geographic_area','2022_pop_est']]
pop_by_st_df.columns = ['state','population']

#Removing rows which aren't states
not_states = ['Annual Estimates of the Resident Population for the United States, Regions, States, District of Columbia, and Puerto Rico: April 1, 2020 to July 1, 2022',
       'Geographic Area', 'United States', 'Northeast', 'Midwest',
       'South', 'West', '.Puerto Rico','.District of Columbia',
       'Note: The estimates are developed from a base that incorporates the 2020 Census, Vintage 2020 estimates, and (for the U.S. only) 2020 Demographic Analysis estimates.  For population estimates methodology statements, see https://www.census.gov/programs-surveys/popest/technical-documentation/methodology.html. See Geographic Terms and Definitions at https://www.census.gov/programs-surveys/popest/guidance-geographies/terms-and-definitions.html for a list of the states that are included in each region. All geographic boundaries for the 2022 population estimates series are as of January 1, 2022. ',
       'Suggested Citation:',
       'Annual Estimates of the Resident Population for the United States, Regions, States, District of Columbia, and Puerto Rico: April 1, 2020 to July 1, 2022 (NST-EST2022-POP)',
       'Source: U.S. Census Bureau, Population Division',
       'Release Date: December 2022']
pop_by_st_df = pop_by_st_df.loc[
                ~pop_by_st_df['state'].isin(not_states)]
pop_by_st_df = pop_by_st_df.loc[
                ~pop_by_st_df['state'].isnull()]

#Verifying only 50 states in final result
if len(pop_by_st_df['state'].unique()) == 50:
    print("Correct number of states remain (50).")
else:
    print("ERROR - {0} states, expecting 50.".format(
        len(pop_by_st_df['state'].unique())))
    
#Removing leading "." in state names which came in from the file
pop_by_st_df['state'] = pop_by_st_df['state'].str.replace('.', '')

#Viewing head of file
pop_by_st_df.head()

Correct number of states remain (50).


  pop_by_st_df['state'] = pop_by_st_df['state'].str.replace('.', '')


Unnamed: 0,state,population
8,Alabama,5074296
9,Alaska,733583
10,Arizona,7359197
11,Arkansas,3045637
12,California,39029342


Finally, we will read in the US states and their regional divisions. We will remove columns containing Census Divisions as these will not be inclued in our final analysis.

In [101]:
#Read in info using pandas
st_by_region = pd.read_csv('../raw_data/states_by_region.csv')
st_by_region.head()

#Remove non-necessary columns & change names
st_by_region_df = st_by_region[['REGION','STATE']]
st_by_region_df.columns = ['regional_division','state']
st_by_region_df.head()

Unnamed: 0,regional_division,state
0,Northeast,Connecticut
1,Northeast,Maine
2,Northeast,Massachusetts
3,Northeast,New Hampshire
4,Northeast,Rhode Island


Now we will begin to merge our dataframes together. We will use the st_by_region as our "left" dataframe and add the population information as the "right" dataframe.

In [102]:
#Merging dataframes
st_reg_pop_df = pd.merge(st_by_region_df, pop_by_st_df,
                         how = 'left', on = 'state')

#Validating that we still have 50 states
if len(st_reg_pop_df) == 50:
    print("Merge went correctly, 50 rows")
else:
    print("ERROR - merge did not work {0} rows".format(
    len(st_reg_pop_df)))
    
#Looking at df
st_reg_pop_df.head()

Merge went correctly, 50 rows


Unnamed: 0,regional_division,state,population
0,Northeast,Connecticut,3626205
1,Northeast,Maine,1385340
2,Northeast,Massachusetts,6981974
3,Northeast,New Hampshire,1395231
4,Northeast,Rhode Island,1093734


Next we will merge in the article title information from articles_by_st_df using "state" as the matching key.

In [103]:
#Merging dataframes
st_reg_pop_title_df = pd.merge(st_reg_pop_df, articles_by_st_df,
                               how = 'left', on = 'state')

#Checking right # of rows
if len(st_reg_pop_title_df) > len(articles_by_st_df):
    print("WARNING - Some states are missing article title information:")
    print(st_reg_pop_title_df[st_reg_pop_title_df.isnull().any(axis = 1)])
if len(st_reg_pop_title_df) < len(articles_by_st_df):
    print("WARNING - there are more states with articles than there are")
    print("states with population in your df")
    print(st_reg_pop_title_df[st_reg_pop_title_df.isnull().any(axis = 1)])
elif len(st_reg_pop_title_df) == len(articles_by_st_df):
    print("Successful merge, expected {0} rows and got {1}".format(
    len(articles_by_st_df), len(st_reg_pop_title_df)))

      regional_division        state population article_title
0             Northeast  Connecticut  3,626,205           NaN
12881           Midwest     Nebraska  1,967,923           NaN


From the above printout we see that there are states which do not have any articles written about their cities, per our data. Given we can look up city information from Connecticut and Nebraska on Wikipedia, we believe this is an issue with our scraping, rather than a lack of articles. We will proceed without the CT and NE data, though it will be noted in our analysis.

In [104]:
#Removing the Connecticut and Nebraska blank rows from our df
st_reg_pop_title_48_df = st_reg_pop_title_df[~st_reg_pop_title_df.isnull().any(axis = 1)]

Now we will merge the revision ids and article quality measuements with our state, region, population, title df. We switch from a left merge to a right merge for our quality rankings, this is because there are likely fewer quality rankings than there are article titles due to ORES being unable to pull rankings for all articles.

In [105]:
#Merge in the revision ids
rev_df = pd.merge(st_reg_pop_title_48_df, page_info_df, how = "left",
                 left_on = "article_title", right_on='article_title')

#Checking right # of rows
if len(rev_df) > len(page_info_df):
    print("WARNING - Some states are missing article title information:")
    print(rev_df[rev_df.isnull().any(axis = 1)])
if len(rev_df) < len(page_info_df):
    print("WARNING - there are more states with articles than there are")
    print("states with population in your df")
    print(rev_df[rev_df.isnull().any(axis = 1)])
elif len(rev_df) == len(page_info_df):
    print("Successful merge, expected {0} rows and got {1}".format(
    len(page_info_df), len(rev_df)))

#Merge in the article quality measures
final_merge_df = pd.merge(rev_df, ores_scores_df, how = "right",
                 left_on = "article_title", right_on='article_title')

#Checking right # of rows
if len(final_merge_df) > len(ores_scores_df):
    print("WARNING - Some articles are missing quality information:")
    print(final_merge_df[final_merge_df.isnull().any(axis = 1)])
if len(final_merge_df) < len(ores_scores_df):
    print("WARNING - there are more quality ratings than article titles")
    print(final_merge_df[final_merge_df.isnull().any(axis = 1)])
elif len(final_merge_df) == len(ores_scores_df):
    print("Successful merge, expected {0} rows and got {1}".format(
    len(ores_scores_df), len(final_merge_df)))

Successful merge, expected 21515 rows and got 21515
Successful merge, expected 21514 rows and got 21514


Ordering and saving the final output to a csv in clean_data.

In [106]:
#Reordering columns
wp_scored_city_articles_by_state = final_merge_df[['state',
       'regional_division', 'population', 'article_title',
       'revision_id', 'article_quality']]

#Saving to CSV
wp_scored_city_articles_by_state.to_csv("../clean_data/wp_scored_city_articles_by_state.csv")