# Project 2: Regression

## Backstory:

Using information we scrape from the web, build linear regression models from which we can learn about movies, sports, or categories.

In [1]:
from bs4 import BeautifulSoup
# import requests
import time, os
import numpy as np
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

from fake_useragent import UserAgent

ua = UserAgent()
user_agent = {'User-agent': ua.random}
print(user_agent)

url = "https://www.atptour.com/en/rankings/singles"

driver = webdriver.Chrome(chromedriver)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')

{'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2820.59 Safari/537.36'}


### Overview rankings table scrape:

In [2]:
ranks=[]
for element in soup.find_all('td', class_='rank-cell'):
    ranks.append(int(element.text.strip()))

In [3]:
players=[]
for element in soup.find_all('td', class_='player-cell'):
    players.append(element.text.strip('\n').lstrip())

In [4]:
countries=[]
for element in soup.find_all('td', class_='country-cell'):
    countries.append(element.find('img').get('alt'))

In [5]:
ages=[]
for element in soup.find_all('td', class_='age-cell'):
    ages.append(int(element.text.strip()))

In [6]:
points=[]
for element in soup.find_all('td', class_='points-cell'):
    points.append(element.text.strip())


In [7]:
tourns = []
for element in soup.find_all('td', class_='tourn-cell'):
    tourns.append(element.text.strip())
# print(tourns)

In [8]:
player_urls = []
for element in soup.find_all('td', class_='player-cell'):
    player_urls.append(element.find('a').get('href'))
# print(player_urls)

### Per player data scrape:

In [32]:
"""
Scrape per-player stats.
"""

# Load URL in Selenium
uri = "https://www.atptour.com"+player_urls[0].replace("overview", "player-stats")
driver.get(uri)
soup = BeautifulSoup(driver.page_source, 'html.parser')

time.sleep(1)

# Scrape left-hand table
table = soup.find('table', class_ = "mega-table")
rows = [row for row in table.find_all('tr')]
rows.pop(0)
stats = {}
for row in rows:
    items = row.find_all('td')
    stats[items[0].text.strip()] = items[1].text.strip()

# Scrape right-hand table
table = table.find_next_sibling()
rows = [row for row in table.find_all('tr')]
rows.pop(0)

for row in rows:
    items = row.find_all('td')
    stats[items[0].text.strip()] = items[1].text.strip()

print(stats)

{'Aces': '5,813', 'Double Faults': '2,438', '1st Serve': '65%', '1st Serve Points Won': '74%', '2nd Serve Points Won': '56%', 'Break Points Faced': '5,480', 'Break Points Saved': '65%', 'Service Games Played': '13,338', 'Service Games Won': '86%', 'Total Service Points Won': '67%', '1st Serve Return Points Won': '34%', '2nd Serve Return Points Won': '55%', 'Break Points Opportunities': '9,376', 'Break Points Converted': '44%', 'Return Games Played': '12,964', 'Return Games Won': '32%', 'Return Points Won': '42%', 'Total Points Won': '54%'}


In [20]:
tennis_dict = {'Rank':ranks, 'Name':players, 'Country':countries, 'Age':ages, "Points":points, "Tournaments":tourns, "URI":player_urls}
tennis_df = pd.DataFrame(tennis_dict)
tennis_df

Unnamed: 0,Rank,Name,Country,Age,Points,Tournaments,URI
0,1,Novak Djokovic,SRB,33,11260,18,/en/players/novak-djokovic/d643/overview
1,2,Rafael Nadal,ESP,34,9850,18,/en/players/rafael-nadal/n409/overview
2,3,Dominic Thiem,AUT,27,9125,21,/en/players/dominic-thiem/tb69/overview
3,4,Roger Federer,SUI,39,6630,16,/en/players/roger-federer/f324/overview
4,5,Daniil Medvedev,RUS,24,5890,24,/en/players/daniil-medvedev/mm58/overview
...,...,...,...,...,...,...,...
95,96,Marcos Giron,USA,27,684,28,/en/players/marcos-giron/gc88/overview
96,97,Yannick Hanfmann,GER,28,682,26,/en/players/yannick-hanfmann/h997/overview
97,98,Andreas Seppi,ITA,36,679,28,/en/players/andreas-seppi/sa93/overview
98,99,Federico Coria,ARG,28,671,35,/en/players/federico-coria/ce77/overview


### Data:

 * **acquisition**: web scraping
 * **storage**: flat files
 * **sources**: (as listed below or any other publicly available information)   
  - movie: boxofficemojo.com, imdb.com   
  - sports: sports-reference.com
  

### Skills:

 * basics of the web (requests, HTML, CSS, JavaScript)
 * web scraping
 * `numpy` and `pandas`
 * `statsmodels`, `scikit-learn`


### Analysis:

 * linear regression is required, other regression methods are optional


## Deliverable/communication:

 * organized project repository
 * slide presentation
 * visual and oral communication in presentations
 * write-up of process and results


### Design:

 * iterative design process
 * "MVP"s and building outward
 * [stand-ups/scrums](https://en.wikipedia.org/wiki/Scrum_(software_development)) (1 minute progress updates to the class)


## More information:

We'll learn about web scraping using two popular tools - BeautifulSoup and Selenium. You must know the very basics of HTML. We can also evolve the way we use Jupyter notebooks; during this project, we begin to use the notebook as a development scratchpad, where we test things out through interactive scripting, but then solidify our work in python modules with reusable functions and classes.

We'll practice using linear regression. We'll have a first taste of feature selection, this time based on our intuition and some trial and error, and we'll build and refine our models.

This project will give you the freedom to challenge yourself, no matter your skill level. Find your boundaries and push them a little further. We are very excited to see what you will learn and do for Project Luther!