___

<a href='https://github.com/eliasmelul/'> <img src='https://s3.us-east-2.amazonaws.com/wordontheamazon.com/NoMargin_NewLogo.png' style='width: 15em;' align='right' /></a>
# Finding my Schitt's Creek
#### Data Collection: General Socioeconomic Data
___
<h3 align="right">by Elias Melul, Data Scientist </h3> 

___



## Collecting General Socioeconomic Data

---
**GOOD NEWS!** If you want to download this data, you don't need to rerun and scrap the web. I've made the data available for you **<a href='https://github.com/eliasmelul/finding_schitts/blob/master/Data/scraped_datausa.csv'>here</a>**.

Want direct access? **<a href='https://s3.us-east-2.amazonaws.com/www.findingmyschittscreek.com/Data/scraped_datausa.csv'>Click Here</a>**

----


## Data Collection

---

Since we've already collected the weather information, we are going to use said dataset to generate URLs that link us to the datausa.io paths for each city.


----
**What data will be collected and how?**

There is a CRAZY amount of data available on the web about all cities, but these can be very dispar and from numerous websites. Lucky me, I found [this](https://datausa.io/) website that contains a lot of the information I could use (thank you Deloitte and Datawheel!). I will then combine this information with weather information and FourSquare information about venues to complete the dataframe and begin modeling!

**Some of the variables scraped are:**

* Population and Population Change (Year to Year)
* Poverty Rate
* Median Age
* Median Household Income and Median Household Income Change (Year to Year)
* Number of Employees and Number of Employees Change (Year to Year)
* Median Property Value and Median Property Value Change (Year to Year)
* Average Male and Female Salary, and a ratio of Average Male to Female Salary
* Gini coefficient in 2017 and 2018, as well as it's change (Year to Year)
* Ratio of Patients to Clinicians (county-wise)
* Foreign-born population percentage
* Citizen population percentage
* Total degrees awarded in 2018 (higher education)
* Male to Female ratio of awarded degrees
* Number of degrees per capita
* Number of households in city
* Population per household (people per household)
* Homeownership Percentage (Rent vs Own)
* Average Commute Time (minutes)



#### Import Libraries

---
The libraries imported are not all used in this notebook. However, to be able to follow the whole project, please make sure you have these installed!

Otherwise use pip or conda to install them on your laptop or computer.

In [1]:
import pandas as pd # import pandas for dataframes
import numpy as np
import requests
from bs4 import BeautifulSoup
import locale
from datetime import datetime
import re
import matplotlib.pyplot as plt
import seaborn as sns
import json
import folium
from pandas.io.json import json_normalize


import matplotlib.cm as cm
import matplotlib.colors as colors

from IPython.display import HTML, display
from IPython.display import Image 
from IPython.core.display import HTML 

pd.options.display.max_columns = None
pd.options.display.max_rows=None


%matplotlib inline

**Import weather dataset**

---

Import weather data to generate URLs and scrap datausa.io

In [2]:
weatherCity_df = pd.read_csv('https://s3.us-east-2.amazonaws.com/www.findingmyschittscreek.com/Data/final_weather_data.csv', index_col=0)
weatherCity_df.head()

Unnamed: 0,State,City,State_Abb,CityState,Jan High,Feb High,Mar High,Apr High,May High,Jun High,Jul High,Aug High,Sep High,Oct High,Nov High,Dec High,Jan Low,Feb Low,Mar Low,Apr Low,May Low,Jun Low,Jul Low,Aug Low,Sep Low,Oct Low,Nov Low,Dec Low,Jan Prec,Feb Prec,Mar Prec,Apr Prec,May Prec,Jun Prec,Jul Prec,Aug Prec,Sep Prec,Oct Prec,Nov Prec,Dec Prec,Jan Snow,Feb Snow,Mar Snow,Apr Snow,May Snow,Jun Snow,Jul Snow,Aug Snow,Sep Snow,Oct Snow,Nov Snow,Dec Snow
0,Alabama,Addison,AL,"Addison, AL",50.0,54.0,63.0,71.0,79.0,85.0,89.0,89.0,83.0,73.0,63.0,52.0,29.0,33.0,40.0,48.0,56.0,64.0,68.0,67.0,60.0,48.0,40.0,32.0,5.03,5.29,5.25,4.78,5.48,4.83,4.83,3.75,4.27,3.77,5.05,5.91,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Alabama,Alabaster,AL,"Alabaster, AL",54.0,59.0,67.0,74.0,81.0,87.0,91.0,90.0,85.0,75.0,66.0,56.0,34.0,38.0,44.0,51.0,60.0,68.0,71.0,70.0,64.0,53.0,44.0,37.0,5.33,5.75,4.48,3.89,4.48,4.63,4.76,4.76,3.83,3.02,4.95,4.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Alabama,Alexander City,AL,"Alexander City, AL",55.0,59.0,68.0,75.0,82.0,88.0,91.0,90.0,85.0,76.0,67.0,57.0,32.0,35.0,41.0,48.0,57.0,66.0,69.0,68.0,62.0,50.0,41.0,34.0,5.21,5.35,5.49,4.11,4.33,4.45,5.31,4.5,4.1,3.08,4.79,4.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Alabama,Aliceville,AL,"Aliceville, AL",56.0,60.0,69.0,77.0,84.0,90.0,92.0,92.0,87.0,78.0,68.0,58.0,31.0,34.0,42.0,49.0,57.0,66.0,70.0,69.0,62.0,50.0,41.0,33.0,5.24,5.59,5.27,4.86,4.6,4.88,4.99,3.57,3.76,3.97,4.84,4.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Alabama,Andalusia,AL,"Andalusia, AL",61.0,65.0,72.0,78.0,85.0,90.0,92.0,91.0,88.0,80.0,72.0,63.0,32.0,35.0,41.0,47.0,56.0,65.0,67.0,68.0,62.0,50.0,40.0,35.0,5.18,5.3,6.35,4.17,3.89,5.3,6.16,5.65,4.62,3.71,4.73,4.91,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Generate a list of URLs that link us to datausa.io's path to each city**

---

For that, we will need a list of all the cities and their respective state abbreviation. Why? Because the path to our scrapping website is as follows:

    /city-name-state_abbretiation
    
    For example:
    /new-york-ny

---

However, not all URLs are that clear. Some contain paths that include 'metropolitan area' or some require a 'county'. In this case, we will generate URLs that also include the following form:

    /new-york-ny-metro-area

Once we generate these URLs, we will check if they are valid, extract another list of only valid URLs and use those to scrape data.

In [3]:
cities_list = weatherCity_df['City']+ "-"+weatherCity_df.State_Abb
cities_list = list(cities_list)
cities_list= [i.lower().replace(' ','-') for i in cities_list]

#Concatenate the domain to the generated path to recreate all the URLS
cities_metro = [i+'-metro-area' for i in cities_list]
cities_list = cities_list+cities_metro

urls_cities = ["https://datausa.io/profile/geo/"+i for i in cities_list]

**Checking the validity of each URL**

---

To prevent computational complications we decided to separate the URL validation and scraping tasks. However, combining these steps simply requires us to move the scraping functions prior to the next loop and include them in the calls where _exist = 1_.

In [25]:
# Now that we have all URLs, we should check whether these URLs are valid
#Check if url connections exist
checker = []
for url in urls_cities:
    c = requests.get(url)
    soup = BeautifulSoup(c.content, features='lxml')
    res = soup.find_all("h4",{"class":"pt-non-ideal-state-title"})
    if (len(res) > 0):
        exist = 0
    else:
        try:
            tes = soup.find("div",{"class":"content-container"})
            tes = tes.get_text()
            if (tes == 'N/A'):
                exist = 0
            else:
                exist =1
        except: 
            exist=1
            pass
           
    tempdic = {"URL":url,
              "Exist":exist}
    checker.append(tempdic)

checker = pd.DataFrame(checker)
checker.head()

Unnamed: 0,URL,Exist
0,https://datausa.io/profile/geo/addison-al,1
1,https://datausa.io/profile/geo/alabaster-al,1
2,https://datausa.io/profile/geo/alexander-city-al,1
3,https://datausa.io/profile/geo/aliceville-al,1
4,https://datausa.io/profile/geo/andalusia-al,1


In [26]:
existing = checker[checker.Exist==1]
print(f"The number of valid URLs is: {existing.shape[0]}")
nonexisting = checker[checker.Exist==0]
print(f"The number of invalid URLs is: {nonexisting.shape[0]}")

The number of valid URLs is: 10958
The number of invalid URLs is: 968


While the number of valid URLs is much larger than the city count we have, some will be duplicates due to our URL generation patterns. We will deal with this issue later on.

In [27]:
#Create a list with the existing
urls_cities = existing['URL'].to_list()

### Scraping dataUSA

For this section, we have defined everything as functions, so that we can easily iterate over all the URLs. 

---
The first function simply retrieves the text data from the HTML section specified by the class types and names.

---

The second function uses the first function to retrieve each metric wanted for each city as strings, and transforms such to the type desired. In this case, all the information is numeric except the name of the city. A dictionary will be returned.

In [4]:
def get_text_from_class(soup,classname,type1="div",type2="class"):
    city_content = soup.find(type1, {type2:classname})
    city_info = city_content.find_all('p')
    content_raw = []
    for i in city_info:
        i = str(i)
        removefirst = i[3:-4]
        content_raw.append(removefirst)
    return(content_raw)

In [5]:
def get_basic_city_info(url):
    html = requests.get(url=url)
    soup = BeautifulSoup(html.content, features='lxml')
    
    ###### Get content from dashboard
    #Name of the city
    city_name = soup.find("p").get_text()
    
    #Raw dashboard content
    content_raw = get_text_from_class(soup, "profile-stats")
    
    #Population of City
    Population = content_raw[1]
    
    if Population[-1:] == "M":
        Population = float(Population[0:-1])*1000000
    else:
        Population = float(Population.replace(',',''))
    
    #Population Change YTY
    Population_Change = content_raw[2]
    if Population_Change[-7:] == 'decline':
        Population_Change = float(Population_Change[0:-9])*-1
    else:
        Population_Change = float(Population_Change[0:-8])
        
    #Poverty Rate
    Poverty_Rate = float(content_raw[4][0:-1])
    
    #Median Age
    Median_Age = float(content_raw[7])
    
    #Median Household Income and Change
    Median_Household_Income = float(content_raw[10][1:].replace(',',''))
    Median_Household_Income_Change = content_raw[11]
    if Median_Household_Income_Change[-7:]=='decline':
        Median_Household_Income_Change = float(Median_Household_Income_Change[0:-9])*-1
    else:
        Median_Household_Income_Change = float(Median_Household_Income_Change[0:-8])
    
    #Number of Employees
    Number_Employees = content_raw[13]
    if Number_Employees[-1:] == "M":
        Number_Employees = float(Number_Employees[0:-1])*1000000
    else:
        Number_Employees = float(Number_Employees.replace(',',''))
    
    #Change Number of Employees
    Number_Employees_Change = content_raw[14]
    if Number_Employees_Change[-7:]=='decline':
        Number_Employees_Change = float(Number_Employees_Change[0:-9])*-1
    else:
        Number_Employees_Change = float(Number_Employees_Change[0:-8])
    
    #Median Property Value
    Median_Property_Value = content_raw[16][1:]
    if Median_Property_Value[-1:] == "M":
        Median_Property_Value = float(Median_Property_Value[0:-1])*1000000
    else:
        Median_Property_Value = float(Median_Property_Value.replace(',',''))
    
    #Median Property Value Change
    Median_Property_Value_Change = content_raw[17]
    if Median_Property_Value_Change[-7:]=='decline':
        Median_Property_Value_Change = float(Median_Property_Value_Change[0:-9])*-1
    else:
        Median_Property_Value_Change = float(Median_Property_Value_Change[0:-8])
    
    #Wage Distribution across genders
    wage_gender = get_text_from_class(soup, "topic income_gender TextViz")
    avg_male_salary = float(wage_gender[2][1:].replace(',',''))
    avg_female_salary = float(wage_gender[5][1:].replace(',',''))
    gender_salary_ratio = avg_male_salary/avg_female_salary
  
    #Gini coefficient distribution
    gini_coeff = get_text_from_class(soup, "topic income_distro TextViz")
    gini_2018 = float(gini_coeff[2])
    gini_2017 = float(gini_coeff[4])
    gini_change_percent = (gini_2018-gini_2017)*100/gini_2017
    
    #Health Ratio (Patients to Clinicians)
    health_ratio = get_text_from_class(soup, "topic clinician_patient_ratio TextViz")
    health_ratio = float(health_ratio[2].replace(' to 1','').replace(',',''))
    
    #Foreign Born Population
    foreign_ratio = get_text_from_class(soup, "topic foreign_born TextViz")
    foreign_ratio = float(foreign_ratio[1][:-1].replace(',',''))
    
    #Citizen Ratio
    citizen_ratio = get_text_from_class(soup, "topic citizenship TextViz")
    citizen_ratio = float(citizen_ratio[1][:-1].replace(',',''))
    
    #Degrees awareded
    # There seem to be some cities without education infomration... we need to handle this exception
    try:
        degrees_awarded = get_text_from_class(soup, "topic edu_gender TextViz")
        degrees_men = float(degrees_awarded[1].replace(',',''))
        degrees_women = float(degrees_awarded[3].replace(',',''))
        degrees_gender_ratio_M2F = degrees_men/degrees_women 
        total_degrees = degrees_men+degrees_women
        degrees_per_capita = total_degrees/Population
    except:
        total_degrees = 0
        degrees_gender_ratio_M2F = 1
        degrees_per_capita = None
    
    #Number of Households
    households = get_text_from_class(soup, "topic household_income TextViz")
    households = households[5]
    if households[-1:] == "M":
        households = float(households[0:-1])*1000000
    elif households[-1:]=="k":
        households = float(households[0:-1])*1000
    else:
        households = float(households.replace(',',''))
    people_per_house = Population/households
    
    #Rent vs Ownership of Homes
    rent_own = get_text_from_class(soup, "topic rent_own TextViz")
    rent_own = float(rent_own[1][:-1])
    
    #Communite time
    commute_time = get_text_from_class(soup, "topic commute_time TextViz")
    commute_time = float(commute_time[1][:-8])
    
    basic_information = {'City':city_name,
                        'Population':Population,
                        'Population Change':Population_Change,
                        'Poverty Rate':Poverty_Rate,
                        'Median Age':Median_Age,
                        'Median Household Income':Median_Household_Income,
                        'Median Household Income Change':Median_Household_Income_Change,
                        'Number Employees':Number_Employees,
                        'Number Employees Change':Number_Employees_Change,
                        'Median Property Value':Median_Property_Value,
                        'Median Property Value Change':Median_Property_Value_Change,
                        'Average Male Salary':avg_male_salary,
                        'Average Female Salary':avg_female_salary,
                        'Gender Salary Ratio M2F':gender_salary_ratio,
                        'Gini 2018':gini_2018,
                        'Gini Change':gini_change_percent,
                        'Patient to Clinician Ratio':health_ratio,
                        'Foreign Born Population Ratio':foreign_ratio,
                        'Citizens Percentage':citizen_ratio,
                        'Total Degrees':total_degrees,
                        'Degrees Ratio M2F':degrees_gender_ratio_M2F,
                        'Degrees per Capita':degrees_per_capita,
                        'Households':households,
                        'People Per House':people_per_house,
                        'Homeownership':rent_own,
                        'Commute Time':commute_time}
    
    
    return basic_information

##### Functions Use Example
---

We will use NYC to see how the functions work and how the output looks.

In [18]:
url = 'https://datausa.io/profile/geo/lancaster-pa-metro-area'
get_basic_city_info(url)

{'City': 'Lancaster, PA',
 'Population': 543557.0,
 'Population Change': 0.12,
 'Poverty Rate': 10.4,
 'Median Age': 38.7,
 'Median Household Income': 66277.0,
 'Median Household Income Change': 4.53,
 'Number Employees': 271712.0,
 'Number Employees Change': -1.2,
 'Median Property Value': 220500.0,
 'Median Property Value Change': 7.98,
 'Average Male Salary': 69779.0,
 'Average Female Salary': 52119.0,
 'Gender Salary Ratio M2F': 1.3388399623937528,
 'Gini 2018': 0.469,
 'Gini Change': -0.6355932203389837,
 'Patient to Clinician Ratio': 1232.0,
 'Foreign Born Population Ratio': 4.83,
 'Citizens Percentage': 97.4,
 'Total Degrees': 4990.0,
 'Degrees Ratio M2F': 0.6990125978890024,
 'Degrees per Capita': 0.009180269962487836,
 'Households': 202000.0,
 'People Per House': 2.6908762376237623,
 'Homeownership': 67.4,
 'Commute Time': 21.4}

**Now that we have defined these functions, lets iterate over all our valid URLs to extract the specified metrics**

---

In [21]:
cities_basic = []
for url in urls_cities:
    try:
        data = get_basic_city_info(url)
        cities_basic.append(data)
    except:
        try:
            data = get_basic_city_info(url)
            cities_basic.append(data)
        except:
            try:
                data = get_basic_city_info(url)
                cities_basic.append(data)
            except:
                pass

cities = pd.DataFrame(cities_basic)
print(cities.shape)
cities.head()

(4233, 26)


Unnamed: 0,City,Population,Population Change,Poverty Rate,Median Age,Median Household Income,Median Household Income Change,Number Employees,Number Employees Change,Median Property Value,Median Property Value Change,Average Male Salary,Average Female Salary,Gender Salary Ratio M2F,Gini 2018,Gini Change,Patient to Clinician Ratio,Foreign Born Population Ratio,Citizens Percentage,Total Degrees,Degrees Ratio M2F,Degrees per Capita,Households,People Per House,Homeownership,Commute Time
0,"Alabaster, AL",32567.0,0.923,11.1,37.0,74383.0,1.44,15674.0,-0.716,167800.0,0.902,58549.0,40615.0,1.441561,0.46,0.436681,1166.0,6.13,94.9,3.0,0.0,9.2e-05,10800.0,3.015463,82.7,28.7
1,"Alexander City, AL",40756.0,-0.493,21.2,43.3,42181.0,7.07,15317.0,1.45,107000.0,1.71,58549.0,40615.0,1.441561,0.46,0.436681,1535.0,1.35,99.2,350.0,0.612903,0.008588,16400.0,2.485122,71.9,24.5
2,"Aliceville, AL",2466.0,-23.4,34.6,36.3,24097.0,20.9,810.0,-18.6,102800.0,5.44,58549.0,40615.0,1.441561,0.46,0.436681,2981.0,0.243,99.8,0.0,1.0,,928.0,2.657328,48.4,21.1
3,"Andalusia, AL",8918.0,-1.37,17.3,41.1,36101.0,6.14,3629.0,7.46,110400.0,12.8,58549.0,40615.0,1.441561,0.46,0.436681,1991.0,1.18,99.7,514.0,0.903704,0.057636,3610.0,2.47036,61.3,15.8
4,"Ashland, AL",2113.0,-5.29,24.6,45.7,27264.0,-12.2,868.0,1.64,133000.0,48.4,58549.0,40615.0,1.441561,0.46,0.436681,3389.0,2.98,97.0,0.0,1.0,,895.0,2.360894,50.8,24.1


When generating the URLs, we generalized to include paths like _/new-york-ny-metro-area_. This may cause an issue: duplicated cities in the dataset. However, the reason why we included this in the first place was to match the general city people know about rather than specific parts of it or boroughs. For example, in DATAUSA, _Lancaser, PA_ is not the metropolitan area that people typically think about (if you ever think about Lancaster), but rather the city/town of Lancaster. 

Therefore, we must delete all duplicated records and keep those that have the largest populations.

---
**Deleting Duplicates**

---

By sorting the data by population descending and reseting the index, we can now remove all duplicate rows based on their position. In other words, we can keep the first duplicate record - the metropolitan area one most likely.

In [28]:
dupli = cities[cities['City'].duplicated(keep=False)]
print(f"Shape of dataframe with duplicates: {dupli.shape}")
dupli = dupli.sort_values(by='Population', ascending=False)
dupli = dupli.reset_index().drop('index', axis=1)
dupli = dupli.drop_duplicates(subset='City',keep='first')
print(f"Shape after dropping duplicates: {dupli.shape}")

cities = cities.drop_duplicates(subset='City', keep=False)
print(f"Shape of complete dataset after dropping all duplicated rows, not just one: {cities.shape}")

cities = pd.concat([cities,dupli])
print(f"The final dataframe has shape: {cities.shape}")
dupli.head()

Shape of dataframe with duplicates: (0, 26)
Shape after dropping duplicates: (0, 26)
Shape of complete dataset after dropping all duplicated rows, not just one: (4050, 26)
The final dataframe has shape: (4050, 26)


Unnamed: 0,City,Population,Population Change,Poverty Rate,Median Age,Median Household Income,Median Household Income Change,Number Employees,Number Employees Change,Median Property Value,Median Property Value Change,Average Male Salary,Average Female Salary,Gender Salary Ratio M2F,Gini 2018,Gini Change,Patient to Clinician Ratio,Foreign Born Population Ratio,Citizens Percentage,Total Degrees,Degrees Ratio M2F,Degrees per Capita,Households,People Per House,Homeownership,Commute Time


In [30]:
cities.head()

Unnamed: 0,City,Population,Population Change,Poverty Rate,Median Age,Median Household Income,Median Household Income Change,Number Employees,Number Employees Change,Median Property Value,Median Property Value Change,Average Male Salary,Average Female Salary,Gender Salary Ratio M2F,Gini 2018,Gini Change,Patient to Clinician Ratio,Foreign Born Population Ratio,Citizens Percentage,Total Degrees,Degrees Ratio M2F,Degrees per Capita,Households,People Per House,Homeownership,Commute Time
0,"Alabaster, AL",32567.0,0.923,11.1,37.0,74383.0,1.44,15674.0,-0.716,167800.0,0.902,58549.0,40615.0,1.441561,0.46,0.436681,1166.0,6.13,94.9,3.0,0.0,9.2e-05,10800.0,3.015463,82.7,28.7
1,"Alexander City, AL",40756.0,-0.493,21.2,43.3,42181.0,7.07,15317.0,1.45,107000.0,1.71,58549.0,40615.0,1.441561,0.46,0.436681,1535.0,1.35,99.2,350.0,0.612903,0.008588,16400.0,2.485122,71.9,24.5
2,"Aliceville, AL",2466.0,-23.4,34.6,36.3,24097.0,20.9,810.0,-18.6,102800.0,5.44,58549.0,40615.0,1.441561,0.46,0.436681,2981.0,0.243,99.8,0.0,1.0,,928.0,2.657328,48.4,21.1
3,"Andalusia, AL",8918.0,-1.37,17.3,41.1,36101.0,6.14,3629.0,7.46,110400.0,12.8,58549.0,40615.0,1.441561,0.46,0.436681,1991.0,1.18,99.7,514.0,0.903704,0.057636,3610.0,2.47036,61.3,15.8
4,"Ashland, AL",2113.0,-5.29,24.6,45.7,27264.0,-12.2,868.0,1.64,133000.0,48.4,58549.0,40615.0,1.441561,0.46,0.436681,3389.0,2.98,97.0,0.0,1.0,,895.0,2.360894,50.8,24.1


Awesome! We now have some general socioeconomic on 4050 cities.

While there were more cities in the weather dataset, due to scraping complications and error handling, some cities did not compute and are not included in the dataset. Nonetheless, these are smaller towns that should not affect us much in our project.

---

However, numerous large cities are defined as a group of cities, so we must append and correct these first!

## Other Cities Scrap

## Creating list of URLs from Directory

While this driver is not consistent, the results shown should be repetitive and encompass all the necessary additions to our cities list! 

In [82]:
abc = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
abc_url_dir = ["https://datausa.io/search/?q="+i+"&dimension=Geography&hierarchy=MSA" for i in abc]
from selenium import webdriver
from time import sleep
city_names = []

for url in abc_url_dir:
    driver = webdriver.Chrome(r"C:\Users\melul\Downloads\chromedriver.exe")
    sleep(2)
    driver.get(url)
    html=driver.page_source
    soup = BeautifulSoup(html, features='lxml')
    findings = soup.find_all('li')
    cities = [i.get_text() for i in findings]
    cities = cities[:-12]
    print(cities)
    if (cities[0] == "No Results Found"):
        driver = webdriver.Chrome(r"C:\Users\melul\Downloads\chromedriver.exe")
        sleep(2)
        driver.get(url)
        html=driver.page_source
        soup = BeautifulSoup(html, features='lxml')
        findings = soup.find_all('li')
        cities = [i.get_text() for i in findings]
        cities = cities[:-12]
        print(cities)
    city_names.append(cities)


['Los Angeles-Long Beach-Anaheim, CAMSA', 'Washington-Arlington-Alexandria, DC-VA-MD-WVMSA', 'New York-Newark-Jersey City, NY-NJ-PAMSA', 'Dallas-Fort Worth-Arlington, TXMSA', 'Atlanta-Sandy Springs-Roswell, GAMSA', 'Sacramento--Roseville--Arden-Arcade, CAMSA', 'Phoenix-Mesa-Scottsdale, AZMSA', 'Chicago-Naperville-Elgin, IL-IN-WIMSA', 'Denver-Aurora-Lakewood, COMSA', 'Houston-The Woodlands-Sugar Land, TXMSA', 'San Antonio-New Braunfels, TXMSA', 'Akron, OHMSA', 'Indianapolis-Carmel-Anderson, INMSA', 'Philadelphia-Camden-Wilmington, PA-NJ-DE-MDMSA', 'Miami-Fort Lauderdale-West Palm Beach, FLMSA', 'Austin-Round Rock, TXMSA', 'Milwaukee-Waukesha-West Allis, WIMSA', 'Albuquerque, NMMSA', 'Boston-Cambridge-Newton, MA-NHMSA', 'San Francisco-Oakland-Hayward, CAMSA', 'Riverside-San Bernardino-Ontario, CAMSA', 'Detroit-Warren-Dearborn, MIMSA', 'Anchorage, AKMSA', 'Seattle-Tacoma-Bellevue, WAMSA', 'Ann Arbor, MIMSA', 'Minneapolis-St. Paul-Bloomington, MN-WIMSA', 'San Diego-Carlsbad, CAMSA', 'Ashev

In [7]:
city_names_list = []
for i in range(len(city_names)):
    for j in range(len(city_names[i])):
        city_names_list.append(city_names[i][j][:-3])

In [8]:
new_city_url = [i.lower().replace("--","-").replace(",","").replace(" ","-") for i in city_names_list]
new_city_metro = [i+"-metro-area" for i in new_city_url]
newCity_urls = new_city_url+new_city_metro
newCity_urls = ["https://datausa.io/profile/geo/"+i for i in newCity_urls]
newCity_urls = list(dict.fromkeys(newCity_urls))
len(newCity_urls)

632

In [9]:
other_cities = []
for url in newCity_urls:
    try:
        data = get_basic_city_info(url)
        other_cities.append(data)
    except:
        try:
            data = get_basic_city_info(url)
            other_cities.append(data)
        except:
            try:
                data = get_basic_city_info(url)
                other_cities.append(data)
            except:
                pass

otherCities_df = pd.DataFrame(other_cities)
print(otherCities_df.shape)
otherCities_df.head()

(404, 26)


Unnamed: 0,City,Population,Population Change,Poverty Rate,Median Age,Median Household Income,Median Household Income Change,Number Employees,Number Employees Change,Median Property Value,Median Property Value Change,Average Male Salary,Average Female Salary,Gender Salary Ratio M2F,Gini 2018,Gini Change,Patient to Clinician Ratio,Foreign Born Population Ratio,Citizens Percentage,Total Degrees,Degrees Ratio M2F,Degrees per Capita,Households,People Per House,Homeownership,Commute Time
0,"Akron, OH",198252.0,-0.129,24.1,36.7,36223.0,2.79,90760.0,2.27,80100.0,0.125,61819.0,45809.0,1.349495,0.461,0.0,1025.0,5.88,96.0,5648.0,0.986634,0.028489,84400.0,2.348957,51.0,21.1
1,"Albuquerque, NM",560234.0,0.3,18.2,37.2,51099.0,1.27,271351.0,2.4,207300.0,5.28,55599.0,42679.0,1.302725,0.465,2.876106,972.0,9.89,94.5,21010.0,0.592028,0.037502,228000.0,2.457167,58.3,21.1
2,"Anchorage, AK",291538.0,-0.957,8.09,34.3,83648.0,5.66,144628.0,-3.06,321300.0,0.406,70907.0,57837.0,1.22598,0.455,2.247191,931.0,10.7,96.1,2909.0,0.589617,0.009978,105000.0,2.776552,64.2,18.7
3,"Ann Arbor, MI",119303.0,1.03,22.1,27.5,61247.0,6.15,61445.0,2.97,271600.0,8.55,63869.0,46762.0,1.365831,0.476,-0.209644,563.0,18.6,88.4,14233.0,1.013724,0.119301,47500.0,2.511642,45.9,18.9
4,"Asheville, NC",89318.0,2.04,15.1,38.6,46464.0,3.38,46512.0,4.95,227500.0,7.31,59982.0,44729.0,1.341009,0.473,0.424628,707.0,7.09,95.8,2379.0,0.71769,0.026635,39700.0,2.249824,49.4,16.1


In [10]:
for i, row in otherCities_df.iterrows():
    otherCities_df.at[i,'CityName'] = row['City'][:row['City'].find(",")]
for i, row in otherCities_df.iterrows():
    ind = row['CityName'].find("-")
    if (ind == -1):
        ind = len(row['CityName'])
        ten = 0
    else:
        ten = 1
    otherCities_df.at[i,'CitySlash'] = row['CityName'][:ind]
    otherCities_df.at[i,'Metro'] = ten

In [11]:
cities = pd.read_csv('https://s3.us-east-2.amazonaws.com/www.findingmyschittscreek.com/Data/scraped_datausa.csv', index_col=0)
cities['CitySlash'] = [i[:-4] for i in cities.City]

In [12]:
complex_city = otherCities_df[otherCities_df['Metro'] == 1]
complex_city = complex_city.drop('Metro',1).drop('CityName',1)
complex_city_list = list(complex_city.CitySlash)

In [13]:
for i, row in cities.iterrows():
    if row.CitySlash in complex_city_list:
        cities.at[i,'LIKE'] = 1
    else:
        cities.at[i,'LIKE'] = 0
LIKE_df = cities[cities['LIKE'] == 1]
UNLIKE_df = cities[cities['LIKE'] == 0]
UNLIKE_df = UNLIKE_df.drop(["CitySlash","LIKE"],1)

Since we are trying to remove duplicated rows based on the city name only, and a lot of cities in our dataset have the same city name but are from different states, we will remove duplicates based on the name of the cities JUST for those cities that have similar names to those appended (the complex-named cities). Hence, we will be working with two datasets:
    1. LIKE_df: Holds the cities and their duplicates from the dataframe based on name only. We will remove duplicates.
    2. UNLIKE_df: Has all other cities, which we will not edit at all except for standardizing names!

In [14]:
LIKE_df = LIKE_df.drop("LIKE",1)
dup_LIKE = pd.concat([LIKE_df,complex_city])
dup_LIKE = dup_LIKE.sort_values('City', ascending=False)
dup_LIKE['State'] = [i[i.find(", ")+2:] for i in dup_LIKE.City]

It seems like the main issue here is with complex city names that have multiple states in their name! After some analysis, we realized that the first state in the name is related to the first city, which is our main point, so we will create a name from it!

In [15]:
for i, row in dup_LIKE.iterrows():
    dup_LIKE.at[i,'City'] = row['CitySlash']+", "+row['State'][0:2]
    
dup_LIKE = dup_LIKE.drop(['CitySlash','State'],1)

In [16]:
dup_LIKE.sort_values('Population', inplace=True, ascending=False)
dup_LIKE = dup_LIKE.drop_duplicates('City', keep='first')
print(dup_LIKE.shape)

(152, 26)


In [17]:
combi_citi = pd.concat([dup_LIKE, UNLIKE_df])
combi_citi = combi_citi.reset_index(drop=True)
combi_citi.shape

(4069, 26)

In [18]:
#combi_citi.to_csv("PATH/scraped_datausa_2.csv")

## Appending Other Metrics
___

After analyzing the data, we realized that some important metrics were lacking. So, I decided to continue scraping the data, using <a href='https://www.numbeo.com'>Numbeo</a>

In [19]:
citi_add_df = combi_citi
citi_add_df['CityName'] = [i[:-4] for i in citi_add_df.City]

Create URLs for all the cities and the websites containing the metrics wanted!

In [20]:
qol_urls = [[row['City'],"https://www.numbeo.com/quality-of-life/in/"+ row['CityName'].replace(" ","-")] for i, row in citi_add_df.iterrows()]
qol_urls_state = [[row['City'],"https://www.numbeo.com/quality-of-life/in/"+ row['City'][:-4].replace(" ","-")+"-"+row['City'][-2:]] for i, row in citi_add_df.iterrows()]
qol_all_urls = qol_urls + qol_urls_state

In [21]:
quality_df = pd.read_html("https://www.numbeo.com/quality-of-life/in/Phoenix")[3]
quality_df.rename(columns = {0:'Metric',1:'Value',2:'Compare'}, inplace=True)
quality_df.set_index("Metric", inplace=True)

In [22]:
def quality_of_life(url, name):
    quality_df = pd.read_html(url)[3]
    quality_df.rename(columns = {0:'Metric',1:'Value',2:'Compare'}, inplace=True)
    quality_df.set_index("Metric", inplace=True)
    
    diction = {"City":name,
               "Purchasing Power": quality_df.loc['Purchasing Power Index','Value'],
              "Safety": quality_df.loc['Safety Index','Value'],
              "Health Care": quality_df.loc['Health Care Index','Value'],
              "Climate": quality_df.loc['Climate Index','Value'],
              "Cost of Living":quality_df.loc['Cost of Living Index','Value'],
              "Pollution": quality_df.loc['Pollution Index','Value'],
              "Purchasing Power Comp": quality_df.loc['Purchasing Power Index','Compare'],
              "Safety Comp": quality_df.loc['Safety Index','Compare'],
              "Health Care Comp": quality_df.loc['Health Care Index','Compare'],
              "Climate Comp": quality_df.loc['Climate Index','Compare'],
              "Cost of Living Comp":quality_df.loc['Cost of Living Index','Compare'],
              "Pollution Comp": quality_df.loc['Pollution Index','Compare']}
    return diction

In [23]:
def quality_of_life_secondary(url, name):
    quality_df = pd.read_html(url)[4]
    quality_df.rename(columns = {0:'Metric',1:'Value',2:'Compare'}, inplace=True)
    quality_df.set_index("Metric", inplace=True)
        
    diction = {"City":name,
               "Purchasing Power": quality_df.loc['Purchasing Power Index','Value'],
              "Safety": quality_df.loc['Safety Index','Value'],
              "Health Care": quality_df.loc['Health Care Index','Value'],
              "Climate": quality_df.loc['Climate Index','Value'],
              "Cost of Living":quality_df.loc['Cost of Living Index','Value'],
              "Pollution": quality_df.loc['Pollution Index','Value'],
              "Purchasing Power Comp": quality_df.loc['Purchasing Power Index','Compare'],
              "Safety Comp": quality_df.loc['Safety Index','Compare'],
              "Health Care Comp": quality_df.loc['Health Care Index','Compare'],
              "Climate Comp": quality_df.loc['Climate Index','Compare'],
              "Cost of Living Comp":quality_df.loc['Cost of Living Index','Compare'],
              "Pollution Comp": quality_df.loc['Pollution Index','Compare']}
    return diction

In [24]:
added_metrics = []
for url in qol_all_urls:
    try:
        data = quality_of_life(url[1], url[0])
        added_metrics.append(data)
    except:
        try:
            data = quality_of_life(url[1], url[0])
            added_metrics.append(data)
        except:
            try:
                data = quality_of_life_secondary(url[1], url[0])
                added_metrics.append(data)
            except:
                try:
                    data = quality_of_life_secondary(url[1], url[0])
                    added_metrics.append(data)
                except:
                    pass

new_metrics = pd.DataFrame(added_metrics)
print(new_metrics.shape)
new_metrics.head()

(684, 13)


Unnamed: 0,City,Purchasing Power,Safety,Health Care,Climate,Cost of Living,Pollution,Purchasing Power Comp,Safety Comp,Health Care Comp,Climate Comp,Cost of Living Comp,Pollution Comp
0,"Los Angeles, CA",105.43,53.53,62.62,95.5,78.52,66.34,High,Moderate,High,Very High,Moderate,High
1,"Dallas, TX",141.64,50.41,65.84,81.85,66.81,42.96,Very High,Moderate,High,Very High,Moderate,Moderate
2,"Washington, DC",114.55,42.45,71.76,81.62,85.18,39.24,Very High,Moderate,High,Very High,Moderate,Low
3,"Philadelphia, PA",88.21,39.16,70.06,77.98,78.77,50.45,Moderate,Low,High,High,Moderate,Moderate
4,"Riverside, CA",110.3,60.68,59.57,86.63,68.22,48.15,Very High,High,Moderate,Very High,Moderate,Moderate


In [25]:
new_metrics['Purchasing Power'] = pd.to_numeric(new_metrics['Purchasing Power'], errors='coerce')
new_metrics['Safety'] = pd.to_numeric(new_metrics['Safety'], errors='coerce')
new_metrics['Health Care'] = pd.to_numeric(new_metrics['Health Care'], errors='coerce')
new_metrics['Climate'] = pd.to_numeric(new_metrics['Climate'], errors='coerce')
new_metrics['Cost of Living'] = pd.to_numeric(new_metrics['Cost of Living'], errors='coerce')
new_metrics['Pollution'] = pd.to_numeric(new_metrics['Pollution'], errors='coerce')

Wow... only 684 of the cities of interest had information on Numbeo. However, it's likely that these include all the large cities, which are the cities of interest for our recommender system! 

---

Lets export this dataset and we will analyze it and possibly combine it during our Exploratory Data Analysis notebook!

In [26]:
# new_metrics.to_csv("PATH/append_gen_data.csv")