<div style="font-family: 'Courier', Courier New, monospace; font-weigth: 400; text-align: center";>
    <h1 >Hack High School Data Camp + 49ers</h1>
    <p style="font-size: 18px">This jupyter notebook contains part of the curriculum of a data camp done by Hack High School in partnership with the San Francisco 49ers. The goal of the project was to teach high schoolers Python, web scraping and data visualization. </p><div>

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">We are going to use the data on the <a src="https://www.49ers.com/team/players-roster/">49ers website</a> to gather insights about their team and create interesting visualizations!</p>

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">First, we need a quick refresher on how the internet works and what happens when you navigate to an url. The Internet is a giant network of computers connected to each other.</p>


<img style="max-height: 200px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Client-server-model.svg/1280px-Client-server-model.svg.png">

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">In orther for communication between computers to happen, a couple of rules (protocols) need to be in place. Each computer needs an IP address, which works as an unique id. The internet also relies on an application structure called client-server model. In this context, clients are applications that request services from servers. A website's url is in fact a "nickname" for a server's IP adress. So when you go to http://www.example.com, your browser is a client application making an HTTP request to a web server. If everything goes as expected and the request is accepted, the server sends an HTTP response status code of 200, and the contents of the page are displayed by your browser.</p>

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">This was a very high level overview, and we suggest you to watch this <a src="http://www.youtube.com/watch?v=Dxcc6ycZ73M">video</a> series by Code.org</p>

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Let's get started by importing the Python library that will allow us to make requests to web servers.</p>

In [None]:
import requests

In [None]:
website_data = requests.get('https://www.49ers.com/team/players-roster/')
website_data.text

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Good Job! Now we have a huge string with the data from the website. However, it looks messy and we only want the table containing the players information.</p>
<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Web pages are written in a markup language called HTML. HTML syntaxt uses tags (</>) to specify the content to be displayed. There are many different tags, such as images and paragraphs. For our exercise, we are looking for the table tag.</p>

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Go to the website, right click on the page and select inspect. You should be able to see all the different elements, and its respective tags.</p>

<img src="images/49ers-website.png" style="heigth: 50%">

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Now we're going to import some additional libraries to extract the table</p>

In [28]:
import pandas as pd
from bs4 import BeautifulSoup

In [29]:
soup = BeautifulSoup(website_data.text, 'lxml')
table = soup.find_all("table")[0]

#pd.read_html retuns a list with one element, a pandas dataframe. 
#So we use index [0] to store the dataframe to players_df
players_df = pd.read_html(str(table))[0]
players_df.head(5)

Unnamed: 0,Player,#,Pos,HT,WT,Age,Exp,College
0,Kapron Lewis-Moore,98.0,DL,6-4,315,29,4,Notre Dame
1,Jordan Smallwood,,WR,6-2,225,24,R,Oklahoma
2,Marcus Lucas,83.0,TE,6-4,250,26,1,Missouri
3,Tyree Robinson,49.0,DB,6-2,215,24,R,Oregon
4,Colin Holba,40.0,LS,6-4,248,24,2,Louisville


<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Well done! With a little help from BeautifulSoup, we were able extract the table. We also used pandas dataframe to store our tabular data, which will be very helpful in the following exercises since we're are going to perform calculations.</p>

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Now let's start creating some interesting visualizations with our data.</p>



In [30]:
!pip install geopy



In [31]:
#Importing packages and libraries to get coordinates from a location
import geopandas 
import geopy
from geopy.geocoders import Nominatim, GoogleV3

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">We're going to use geopy to get the coordinates for each entry on the "College" column.</p>

In [56]:
def get_latitude(x):
  return x.latitude

def get_longitude(x):
  return x.longitude

geolocator = Nominatim()

geolocate_column = players_df['College'].apply(geolocator.geocode)
players_df['Latitude'] = geolocate_column.apply(get_latitude)
players_df['Longitude'] = geolocate_column.apply(get_longitude)



Using Nominatim with the default "geopy/1.18.1" `user_agent` is strongly discouraged, as it violates Nominatim's ToS https://operations.osmfoundation.org/policies/nominatim/ and may possibly cause 403 and 429 HTTP errors. Please specify a custom `user_agent` with `Nominatim(user_agent="my-application")` or by overriding the default `user_agent`: `geopy.geocoders.options.default_user_agent = "my-application"`. In geopy 2.0 this will become an exception.



GeocoderTimedOut: Service timed out

In [None]:
players_df.head(5)

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Now that we have the coordinates, we are going to use plotly to visualize on a map where each player went to college</p> 

In [None]:
# Importing plotly
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()

In [None]:
data = [dict(
   type = 'scattergeo',
    locationmode = 'USA-states',
    lon = players_df['Longitude'],
    lat = players_df['Latitude'],
    text = players_df['Player'] + ' - ' + players_df['College'],
    mode = 'markers'
    )]

layout = dict(
    title = "San Francisco 49ers - Player's Colleges",
    geo = dict(
      scope='usa',
      projection=dict( type='albers usa' ),
      showland = True,
      landcolor = "rgb(250, 250, 250)",
      subunitcolor = "rgb(217, 217, 217)",
      countrycolor = "rgb(217, 217, 217)",
      countrywidth = 0.5,
      subunitwidth = 0.5
      )
)

fig = go.Figure(layout=layout, data=data)
plotly.offline.iplot(fig, validate=False, filename='49ers-map')

<center><p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">You just created an interactive map, good job!</p>(For a static version of the plot, checkout the image 49ers-plot.png)</center>

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Now we are moving to a more advanced exercise. We are going to calculate each player's BMI (Body Mass Index).</p>
<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Since we have the heights and weights in feet/inches and pounds, we're going to use the following formula:</p>

<img src="images/bmi-formula.jpg">

<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">Let's take a look at the dataframe again.</p>

In [None]:
players_df.head(5)

In [61]:
players_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 9 columns):
Player     72 non-null object
#          71 non-null float64
Pos        72 non-null object
HT         72 non-null int64
WT         72 non-null int64
Age        72 non-null int64
Exp        72 non-null object
College    72 non-null object
BMI        72 non-null float64
dtypes: float64(2), int64(3), object(4)
memory usage: 5.1+ KB


<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">We can see that the values in the weight columns ("WT") are numbers (int64). However, the height column ("HT") has objects as values. In order to perform calculations, we need to convert the height values to numbers. In addtion, we also need to convert the first number to inches. (E.g: 5-5 = 5 * 12 + 5 = 60 inches).</p>
<p style="font-family: 'Courier', Courier New, monospace; font-size: 15px">So let's get started!</p>

In [53]:
players_df.head()

Unnamed: 0,Player,#,Pos,HT,WT,Age,Exp,College
0,Kapron Lewis-Moore,98.0,DL,76,315,29,4,Notre Dame
1,Jordan Smallwood,,WR,74,225,24,R,Oklahoma
2,Marcus Lucas,83.0,TE,76,250,26,1,Missouri
3,Tyree Robinson,49.0,DB,74,215,24,R,Oregon
4,Colin Holba,40.0,LS,76,248,24,2,Louisville


In [51]:
def convert_height(x):
    x = int(list(x)[0]) * 12 + int(list(x)[-1])
    return(x)

In [52]:
players_df["HT"] = players_df["HT"].apply(convert_height)

TypeError: 'int' object is not iterable

In [44]:
players_df.head()

Unnamed: 0,Player,#,Pos,HT,WT,Age,Exp,College
0,Kapron Lewis-Moore,98.0,DL,76,315,29,4,Notre Dame
1,Jordan Smallwood,,WR,74,225,24,R,Oklahoma
2,Marcus Lucas,83.0,TE,76,250,26,1,Missouri
3,Tyree Robinson,49.0,DB,74,215,24,R,Oregon
4,Colin Holba,40.0,LS,76,248,24,2,Louisville


In [67]:
def calculate_bmi(height, weight):
    bmi = (weight/ height**2)*703
    return(round(bmi, 2))

In [68]:
players_df['BMI'] = players_df.apply(lambda row: calculate_bmi(row['HT'], row['WT']), axis=1)

In [69]:
players_df.head()

Unnamed: 0,Player,#,Pos,HT,WT,Age,Exp,College,BMI
0,Kapron Lewis-Moore,98.0,DL,76,315,29,4,Notre Dame,38.34
1,Jordan Smallwood,,WR,74,225,24,R,Oklahoma,28.89
2,Marcus Lucas,83.0,TE,76,250,26,1,Missouri,30.43
3,Tyree Robinson,49.0,DB,74,215,24,R,Oregon,27.6
4,Colin Holba,40.0,LS,76,248,24,2,Louisville,30.18
