# This notebook serves as the executable software that reads in datasets, combines datasets, and runs data analysis

The end goal is to perform analysis on how the coverage of US cities on wikipedia and how the quality of articles about cities varies among states. 

Steps: 
1. Confirm datasets are ready to go.
1. Combine dataset of wikipedia articles with a dataset of state populations
2. Use ORES to estiamte quality of articles about the cities
3. Data analyis
   3a. The states with the greatest and least coverage of cities on Wikipedia compared  
       to their population.
   3b. The states with the highest and lowest proportion of high quality articles about cities.
   3c. A ranking of US geographic regions by articles-per-person and proportion of high 
       quality articles.



# Import packages

In [6]:
import numpy as np
import pandas as pd
import csv

# Part 1. Confirm datasets are ready to go. 

Confirm that the files below are there:  
./data/PopulationEstimates.csv   
./data/States_by_region.csv  
./data/us_cities_by_state_SEPT2023.csv  

#### Step 1A: Read in ./data/PopulationEstimates.csv as csv object.
    ##### Step 1A1: Store only state and population data from csv into list for just states. 
    ####  Step 1A2: Output first, middle, and last row of csv object to confirm data is in memory. 
#### Step 1B: Read in ./data/States_by_region.csv as csv object.
    ##### Step 1B1: Store all data from csv into list. 
    ##### Step 1B2: Remove header row from list to prevent it showing up later.    
    ##### Step 1B3: Output first, middle, and last row of csv object to confirm data is in memory. 
#### Step 1C: Read in ./data/us_cities_by_state_SEPT2023.csv as csv object.
    ##### Step 1C1: Store all data from csv into list. 
    ##### Step 1C2: Remove header row from list to prevent it showing up later.    
    ##### Step 1C3: Output first, middle, and last row of csv object to confirm data is in memory. 

### Data sources: 
##### ./data/PopulationEstimates.csv  
https://www2.census.gov/programs-surveys/popest/datasets/2020-2022/state/totals/NST-EST2022-ALLDATA.csv    
https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html  
##### ./data/States_by_region.csv    
The shared google drive for Homework 2 by UW Data 512 provides this file.   
##### ./data/us_cities_by_state_SEPT2023.csv  
The shared google drive for Homework 2 by UW data 512 provies this file.   
However the Wikipedia Category:Lists of cities in the United States by   
state was crawled to generate a list of Wikipedia article pages about US  
cities from each state.   


In [75]:
# Step 1A: Read in ./data/PopulationEstimates.csv as csv object.
StatePopEstimates = []
with open('../data/PopulationEstimates.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Step 1A1: Store only state and population data from csv into list for just states. 
    #  This means excluding the first 15 and the last rows, which correspond to 
    #  the header row, larger regions of the United States, and Puerto Rico, which are not states. 
    counter = 0
    for row in reader: 
        if(counter > 14 and counter < 66):
            StatePopEstimates.append([row[4], row[8]]) 
        counter += 1
        
# 1A1: Remove district of columbia
StatePopEstimates.pop(8)
        
# Step 1A2: Output first, middle, and last row of csv object to confirm data is in memory.  
print("Number of states, should be 50:", len(StatePopEstimates))
print(StatePopEstimates[0])
print(StatePopEstimates[len(StatePopEstimates)//2])
print(StatePopEstimates[len(StatePopEstimates)-1])

Number of states, should be 50: 50
['Alabama', '5074296']
['Montana', '1122867']
['Wyoming', '581381']


In [76]:
# Step 1B: Read in ./data/States_by_region.csv as csv object.
StateRegions = []
with open('../data/States_by_region.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Step 1B1: Store all data from csv into list. 
    for row in reader: 
        StateRegions.append([row[0], row[1], row[2]]) 

# Step 1B2: Remove header row
StateRegions.pop(0)

# Step 1B3: Output first, middle, and last row of csv object to confirm data is in memory.
print("Number of states, should be 50:", len(StateRegions))
print(StateRegions[0])
print(StateRegions[len(StateRegions)//2])
print(StateRegions[len(StateRegions)-1])

Number of states, should be 50: 50
['Northeast', 'New England', 'Connecticut']
['South', 'South Atlantic', 'North Carolina']
['West', 'Pacific', 'Washington']


In [77]:
#### Step 1C: Read in ./data/us_cities_by_state_SEPT2023.csv as csv object.
CityArticles = []
with open('../data/us_cities_by_state_SEPT2023.csv', newline='', encoding="utf8") as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Step 1C1: Store all data from csv into list. 
    for row in reader: 
        CityArticles.append([row[0], row[1], row[2]]) 

# Step 1C2: Remove header row
CityArticles.pop(0)

# Step 1C3: Output first, middle, and last row of csv object to confirm data is in memory.
print(CityArticles[0])
print(CityArticles[len(CityArticles)//2])
print(CityArticles[len(CityArticles)-1])

['Alabama', 'Abbeville, Alabama', 'https://en.wikipedia.org/wiki/Abbeville,_Alabama']
['Minnesota', 'Sargeant, Minnesota', 'https://en.wikipedia.org/wiki/Sargeant,_Minnesota']
['Wyoming', 'Yoder, Wyoming', 'https://en.wikipedia.org/wiki/Yoder,_Wyoming']


# Part 2. Get Article Quality Predictions