___

<a href='https://github.com/eliasmelul/'> <img src='https://s3.us-east-2.amazonaws.com/wordontheamazon.com/NoMargin_NewLogo.png' style='width: 15em;' align='right' /></a>
# Finding my Schitt's Creek
#### Data Collection: Weather and City Names
___
<h3 align="right">by Elias Melul, Data Scientist </h3> 

___



## Collecting Weather Data and City Names

---
**GOOD NEWS!** If you want to download this data, you don't need to rerun and scrap the web. I've made the data available for you **<a href='https://github.com/eliasmelul/finding_schitts/blob/master/Data/final_weather_data.csv'>here</a>**.

Want direct access? **<a href='https://s3.us-east-2.amazonaws.com/www.findingmyschittscreek.com/Data/final_weather_data.csv'>Click Here</a>**

---
This data includes the average monthly temperature of about 5800 cities and the monthy _swing_ - the difference between the average maximum temperature (F) and the average minimum temperature (F) every month.

----


<h2 id = "BusinessUnderstanding">Business Understanding</h2>

The purpose of this capstone project is quite simple, and one that many people have encountered.
_______________________________________________________________________________________________________________________

Most of us have heard of New York City, Boston, San Francisco, Chicago, Miami... all very different cities with a lot to offer. However, there are many incredible cities in the United States, cities that we might not know much about, nor we learn about until someone introduces us to it!

As an international student in the US, I constantly wonder if my desire to go to one of the big cities (NYC, Boston, San Fran. etc.) is valid and if there are other places in the US that I may not know of. I go to Duke University, and I really liked Raleigh-Durham. Had I not gone there, I would not have known!

-------------------

So I am going to classify the cities based on an amalgalm of features to see which one are statistically most similar than others. I will also build a recommendator system that takes your favorite cities (and least favorites!) and the rating you give them and return other cities with similar characteristics. Feel free to try it out!

<h2 id = "DataCollection">Data Collection</h2>

In order to create a competent recommender system, I need information from cities in the USA. We will then analyze such data, select the features we want to use, and begin building!

----
**Which cities will I include?**

The cities to be included are those that are listed in <a href='https://www.usclimatedata.com/climate/united-states/us'>this</a> website. 


**_States:_** 

<img src='https://s3.us-east-2.amazonaws.com/www.findingmyschittscreek.com/Images/USA_States_Table_WSite.PNG' style='width: 30em;' align=auto /></a>

As you can observe in the image, the website contains a table that references all the states in the US. 

<img src='https://s3.us-east-2.amazonaws.com/www.findingmyschittscreek.com/Images/States_Cities_Table_WSite.PNG' style='width: 30em;' align=auto /></a>

Once you click on one of the states, like Alabama in this example, a table of all the cities in such state appears. 

Therefore, we will user this website and BeautifulSoup to get all the cities in each state, and the references (URLs) to each of the cities in each state so we can loop through these and scrap all the weather data.


----
**What data will be collected and how?**

There is a CRAZY amount of data available on the web about all cities, but these can be very dispar and from numerous websites. Lucky me, I found [this](https://datausa.io/) website that contains a lot of the information I could use (thank you Deloitte and Datawheel!). I will then combine this information with weather information and FourSquare information about venues to complete the dataframe and begin modeling!

**Some of the variables scraped are:**

* Population and Population Change (Year to Year)
* Poverty Rate
* Median Age
* Median Household Income and Median Household Income Change (Year to Year)
* Number of Employees and Number of Employees Change (Year to Year)
* Median Property Value and Median Property Value Change (Year to Year)
* Average Male and Female Salary, and a ratio of Average Male to Female Salary
* Gini coefficient in 2017 and 2018, as well as it's change (Year to Year)
* Ratio of Patients to Clinicians (county-wise)
* Foreign-born population percentage
* Citizen population percentage
* Total degrees awarded in 2018 (higher education)
* Male to Female ratio of awarded degrees
* Number of degrees per capita
* Number of households in city
* Population per household (people per household)
* Homeownership Percentage (Rent vs Own)
* Average Commute Time (minutes)



#### Import Libraries

---
The libraries imported are not all used in this notebook. However, to be able to follow the whole project, please make sure you have these installed!

Otherwise use pip or conda to install them on your laptop or computer.

In [1]:
import pandas as pd # import pandas for dataframes
import numpy as np
import requests
from bs4 import BeautifulSoup
import locale
from datetime import datetime
import re
import matplotlib.pyplot as plt
import seaborn as sns
import json
import folium
from pandas.io.json import json_normalize


import matplotlib.cm as cm
import matplotlib.colors as colors

from IPython.display import HTML, display
from IPython.display import Image 
from IPython.core.display import HTML 

pd.options.display.max_columns = None
pd.options.display.max_rows=None


%matplotlib inline

<h3 id ="GettingList">Getting List of Cities</h3>

---

To get the list of cities, we will be be using <a href='https://www.usclimatedata.com/climate/united-states/us'>this</a> weather website. This will allow is to get the list of states and the names of each city in every state along with the weather data for each city in each state.
1. Get the name of each state along with the link to the websites that contains all the cities in each state. 
2. Get the name of each city along with the link to the websites that contains all the weather information in each city.
3. Scrap the desired weather data for each city in every state as specified in this website.
    - Average monthly temperature
    - Swing from average high to average low temperature for every month.

#### Generating the URLs

In [2]:
# Create list of tuples containing the name of the state and the href
def get_state_names(url):
    states = []
    req = requests.get(url)
    soup = BeautifulSoup(req.content, features='lxml')
    table2 = soup.find_all("a",{"class":"stretched-link"})
    for j in range(0,len(table2)):
        newurl = 'https://www.usclimatedata.com'+table2[j].get('href')
        name_state = table2[j].get_text()
        tempdic = {'State':name_state,'url':newurl}
        states.append(tempdic)
    states_df = pd.DataFrame.from_dict(states)
    return states_df
states_names_list = get_state_names('https://www.usclimatedata.com/')

District of Columbia is weirdly defined in our dataframe... lets fix it.

In [3]:
states_names_list.at[8,'State'] = "District of Columbia"
states_names_list[0:5]

Unnamed: 0,State,url
0,Alabama,https://www.usclimatedata.com/climate/alabama/...
1,Alaska,https://www.usclimatedata.com/climate/alaska/u...
2,Arizona,https://www.usclimatedata.com/climate/arizona/...
3,Arkansas,https://www.usclimatedata.com/climate/arkansas...
4,California,https://www.usclimatedata.com/climate/californ...


In [4]:
def get_city_names(url):
    cities_st = []
    req = requests.get(url)
    soup = BeautifulSoup(req.content, features='lxml')
    name_state = soup.find("p",{"class":"selection_title"})
    name_state = name_state.get_text()[:-24]
    table1 = soup.find_all("a", {"class":"stretched-link"})
    for i in range(0,len(table1)):
        name_city = table1[i].get_text()
        newurl = 'https://www.usclimatedata.com'+table1[i].get('href')
        tempdic = {'State':name_state,'City':name_city,'url_city':newurl}
        cities_st.append(tempdic)
    cities_df = pd.DataFrame(cities_st)
    return cities_df
excities = get_city_names('https://www.usclimatedata.com/climate/alabama/united-states/3170')
excities.head()

Unnamed: 0,State,City,url_city
0,Alabama,Addison,https://www.usclimatedata.com/climate/addison/...
1,Alabama,Alabaster,https://www.usclimatedata.com/climate/alabaste...
2,Alabama,Alexander City,https://www.usclimatedata.com/climate/alexande...
3,Alabama,Aliceville,https://www.usclimatedata.com/climate/alicevil...
4,Alabama,Andalusia,https://www.usclimatedata.com/climate/andalusi...


In [5]:
state_city_urls = []
for i, row in states_names_list.iterrows():
    random_state_city = get_city_names(row.url)
    state_city_urls.append(random_state_city)


In [6]:
state_city_df=pd.DataFrame()
for i in range(0,len(state_city_urls)):
    ith_df = state_city_urls[i]
    state_city_df = state_city_df.append(ith_df, ignore_index=True)
state_city_df.head()

Unnamed: 0,State,City,url_city
0,Alabama,Addison,https://www.usclimatedata.com/climate/addison/...
1,Alabama,Alabaster,https://www.usclimatedata.com/climate/alabaste...
2,Alabama,Alexander City,https://www.usclimatedata.com/climate/alexande...
3,Alabama,Aliceville,https://www.usclimatedata.com/climate/alicevil...
4,Alabama,Andalusia,https://www.usclimatedata.com/climate/andalusi...


In [7]:
print(state_city_df.shape)
state_city_df.head()

(5850, 3)


Unnamed: 0,State,City,url_city
0,Alabama,Addison,https://www.usclimatedata.com/climate/addison/...
1,Alabama,Alabaster,https://www.usclimatedata.com/climate/alabaste...
2,Alabama,Alexander City,https://www.usclimatedata.com/climate/alexande...
3,Alabama,Aliceville,https://www.usclimatedata.com/climate/alicevil...
4,Alabama,Andalusia,https://www.usclimatedata.com/climate/andalusi...


---

Now that we have a dataframe with all the cities and their respective URL to scrap weather data, lets begin scraping!

In [23]:
#Now, let's get the average highs, lows and precipitation in each city
def get_weather(state, city, url):
    req = requests.get(url)
    soup = BeautifulSoup(req.content, features='lxml')
    temp_high = soup.find_all("td",{"class":"high text-right"})
    temp_high = [i.get_text() for i in temp_high]
    temp_high = temp_high[0:12]
    temp_low = soup.find_all("td",{"class":"low text-right"})
    temp_low = [i.get_text() for i in temp_low]
    temp_low = temp_low[0:12]
    precip = soup.find_all("td",{"class":"text-right"})
    precip = [i.get_text() for i in precip]
    precip = precip[12:18]+precip[30:36]
    swing = [float(temp_high[i])-float(temp_low[i]) for i in range(0,len(temp_high))]
    avg_day = [(float(temp_high[i])+float(temp_low[i]))/2 for i in range(0,len(temp_high))]
    ava = avg_day+swing+precip
    df = pd.DataFrame(ava, index = ['Temp Jan','Temp Feb','Temp Mar','Temp Apr','Temp May','Temp Jun','Temp Jul','Temp Aug','Temp Sep','Temp Oct','Temp Nov','Temp Dec','Swing Jan','Swing Feb','Swing Mar','Swing Apr','Swing May','Swing Jun','Swing Jul','Swing Aug','Swing Sep','Swing Oct','Swing Nov','Swing Dec','Precip Jan','Precip Feb','Precip Mar','Precip Apr','Precip May','Precip Jun','Precip Jul','Precip Aug','Precip Sep','Precip Oct','Precip Nov','Precip Dec'])
    df = df.transpose()
    df['State'] = state
    df['City'] = city
    return df
get_weather('Florida','Apalachicola','https://www.usclimatedata.com/climate/addison/alabama/united-states/usal0586')

Unnamed: 0,Temp Jan,Temp Feb,Temp Mar,Temp Apr,Temp May,Temp Jun,Temp Jul,Temp Aug,Temp Sep,Temp Oct,Temp Nov,Temp Dec,Swing Jan,Swing Feb,Swing Mar,Swing Apr,Swing May,Swing Jun,Swing Jul,Swing Aug,Swing Sep,Swing Oct,Swing Nov,Swing Dec,Precip Jan,Precip Feb,Precip Mar,Precip Apr,Precip May,Precip Jun,Precip Jul,Precip Aug,Precip Sep,Precip Oct,Precip Nov,Precip Dec,State,City
0,39.5,43.5,51.5,59.5,67.5,74.5,78.5,78,71.5,60.5,51.5,42,21,21,23,23,23,21,21,22,23,25,23,20,5.03,5.29,5.25,4.78,5.48,4.83,4.83,3.75,4.27,3.77,5.05,5.91,Florida,Apalachicola


**The next step takes a while, depending on your internet connection! Expect it to run somewhere around 30min!**

In [26]:
all_weather_df = pd.DataFrame()
for i, row in state_city_df.iterrows():
    try:
        newweather = get_weather(row.State, row.City, row.url_city)
        all_weather_df = all_weather_df.append(newweather, ignore_index=True)
    except:
        pass

In [27]:
print(all_weather_df.shape)
all_weather_df.head()

(5808, 38)


Unnamed: 0,Temp Jan,Temp Feb,Temp Mar,Temp Apr,Temp May,Temp Jun,Temp Jul,Temp Aug,Temp Sep,Temp Oct,Temp Nov,Temp Dec,Swing Jan,Swing Feb,Swing Mar,Swing Apr,Swing May,Swing Jun,Swing Jul,Swing Aug,Swing Sep,Swing Oct,Swing Nov,Swing Dec,Precip Jan,Precip Feb,Precip Mar,Precip Apr,Precip May,Precip Jun,Precip Jul,Precip Aug,Precip Sep,Precip Oct,Precip Nov,Precip Dec,State,City
0,39.5,43.5,51.5,59.5,67.5,74.5,78.5,78.0,71.5,60.5,51.5,42.0,21,21,23,23,23,21,21,22,23,25,23,20,5.03,5.29,5.25,4.78,5.48,4.83,4.83,3.75,4.27,3.77,5.05,5.91,Alabama,Addison
1,44.0,48.5,55.5,62.5,70.5,77.5,81.0,80.0,74.5,64.0,55.0,46.5,20,21,23,23,21,19,20,20,21,22,22,19,5.33,5.75,4.48,3.89,4.48,4.63,4.76,4.76,3.83,3.02,4.95,4.8,Alabama,Alabaster
2,43.5,47.0,54.5,61.5,69.5,77.0,80.0,79.0,73.5,63.0,54.0,45.5,23,24,27,27,25,22,22,22,23,26,26,23,5.21,5.35,5.49,4.11,4.33,4.45,5.31,4.5,4.1,3.08,4.79,4.9,Alabama,Alexander City
3,43.5,47.0,55.5,63.0,70.5,78.0,81.0,80.5,74.5,64.0,54.5,45.5,25,26,27,28,27,24,22,23,25,28,27,25,5.24,5.59,5.27,4.86,4.6,4.88,4.99,3.57,3.76,3.97,4.84,4.6,Alabama,Aliceville
4,46.5,50.0,56.5,62.5,70.5,77.5,79.5,79.5,75.0,65.0,56.0,49.0,29,30,31,31,29,25,25,23,26,30,32,28,5.18,5.3,6.35,4.17,3.89,5.3,6.16,5.65,4.62,3.71,4.73,4.91,Alabama,Andalusia


**State abbreviations**

We are going to add the abbreviation to each state for future analytical purposes. These will become clear in the Data Collection - DATAUSA section.

---
To add the abbreviations, we must get a list of abbreviations and their respective states spelled out. Luckily, Wikipedia has a site listing all the US abbreviations! We will first import them, and then add the appropiate abbreviation to each City in our dataframe.

In [28]:
abb_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations', skiprows=11)[0]
abb_df = abb_df[['United States of America', 'Unnamed: 5']]
abb_df = abb_df.rename(columns={'United States of America':'State','Unnamed: 5':'State_Abb'})

In [29]:
#Now let's create a list of urls for further scrapping
weatherCity_df = all_weather_df
weatherCity_df = weatherCity_df.merge(abb_df, on='State')
weatherCity_df['CityState'] = [row.City+", "+row.State_Abb for i, row in weatherCity_df.iterrows()]

#Shift city and state column to firsts
cols = weatherCity_df.columns.tolist()
cols = cols[-4:]+cols[:-4]
weatherCity_df = weatherCity_df[cols]


In [30]:
weatherCity_df.head()

Unnamed: 0,State,City,State_Abb,CityState,Temp Jan,Temp Feb,Temp Mar,Temp Apr,Temp May,Temp Jun,Temp Jul,Temp Aug,Temp Sep,Temp Oct,Temp Nov,Temp Dec,Swing Jan,Swing Feb,Swing Mar,Swing Apr,Swing May,Swing Jun,Swing Jul,Swing Aug,Swing Sep,Swing Oct,Swing Nov,Swing Dec,Precip Jan,Precip Feb,Precip Mar,Precip Apr,Precip May,Precip Jun,Precip Jul,Precip Aug,Precip Sep,Precip Oct,Precip Nov,Precip Dec
0,Alabama,Addison,AL,"Addison, AL",39.5,43.5,51.5,59.5,67.5,74.5,78.5,78.0,71.5,60.5,51.5,42.0,21,21,23,23,23,21,21,22,23,25,23,20,5.03,5.29,5.25,4.78,5.48,4.83,4.83,3.75,4.27,3.77,5.05,5.91
1,Alabama,Alabaster,AL,"Alabaster, AL",44.0,48.5,55.5,62.5,70.5,77.5,81.0,80.0,74.5,64.0,55.0,46.5,20,21,23,23,21,19,20,20,21,22,22,19,5.33,5.75,4.48,3.89,4.48,4.63,4.76,4.76,3.83,3.02,4.95,4.8
2,Alabama,Alexander City,AL,"Alexander City, AL",43.5,47.0,54.5,61.5,69.5,77.0,80.0,79.0,73.5,63.0,54.0,45.5,23,24,27,27,25,22,22,22,23,26,26,23,5.21,5.35,5.49,4.11,4.33,4.45,5.31,4.5,4.1,3.08,4.79,4.9
3,Alabama,Aliceville,AL,"Aliceville, AL",43.5,47.0,55.5,63.0,70.5,78.0,81.0,80.5,74.5,64.0,54.5,45.5,25,26,27,28,27,24,22,23,25,28,27,25,5.24,5.59,5.27,4.86,4.6,4.88,4.99,3.57,3.76,3.97,4.84,4.6
4,Alabama,Andalusia,AL,"Andalusia, AL",46.5,50.0,56.5,62.5,70.5,77.5,79.5,79.5,75.0,65.0,56.0,49.0,29,30,31,31,29,25,25,23,26,30,32,28,5.18,5.3,6.35,4.17,3.89,5.3,6.16,5.65,4.62,3.71,4.73,4.91


In [31]:
weatherCity_df.to_csv("YOUR_PATH/final_weather_data.csv")

There it goes! We've scraped the weather information and structured it in our desired format. Now, lets go to the next section to scrap the general information for each city.