# Challenge Set 9
## Part III: Soccer Data

*Introductory - Intermediate level SQL*

--

Please complete this exercise using sqlite3 and Jupyter notebook.

Download the [SQLite database](https://www.kaggle.com/hugomathien/soccer/downloads/soccer.zip) and load in your notebook using the sqlite3 library. 

1. Which team scored the most points when playing at home?  

2. Did this team also score the most points when playing away?  

3. How many matches resulted in a tie?  

4. How many players have Smith for their last name? How many have 'smith' anywhere in their name?

5. What was the median tie score? Use the value determined in the previous question for the number of tie games. *Hint:* PostgreSQL does not have a median function. Instead, think about the steps required to calculate a median and use the [`WITH`](https://www.postgresql.org/docs/8.4/static/queries-with.html) command to store stepwise results as a table and then operate on these results. 

6. What percentage of players prefer their left or right foot? *Hint:* Calculate either the right or left foot, whichever is easier based on how you setup the problem.

In [58]:
import sqlite3
import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)

In [21]:
# Open connection
conn = sqlite3.connect('/home/cneiderer/Metis/Neiderer_Metis/Challenges/challenges_data/soccer.sqlite')
cur = conn.cursor()

In [40]:
# Get table names
cur.execute('SELECT * FROM sqlite_master WHERE type="table" ORDER BY Name')

tables = []
for t in cur.fetchall():
    tables.append(t[1])

In [46]:
# Query data from each table into dictionary of DFs
df = {}
for t in tables:
    df[t] = pd.read_sql_query('SELECT * FROM ' + t, conn)

In [47]:
df.keys()

dict_keys(['Country', 'League', 'Match', 'Player', 'Player_Attributes', 'Team', 'Team_Attributes', 'sqlite_sequence'])

In [50]:
# Inspect DF properties
for key in df.keys():
    print(key)
    print(df[key].info())
    print('\n')

Country
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
id      11 non-null int64
name    11 non-null object
dtypes: int64(1), object(1)
memory usage: 256.0+ bytes
None


League
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
id            11 non-null int64
country_id    11 non-null int64
name          11 non-null object
dtypes: int64(2), object(1)
memory usage: 344.0+ bytes
None


Match
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25979 entries, 0 to 25978
Columns: 115 entries, id to BSA
dtypes: float64(96), int64(9), object(10)
memory usage: 22.8+ MB
None


Player
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11060 entries, 0 to 11059
Data columns (total 7 columns):
id                    11060 non-null int64
player_api_id         11060 non-null int64
player_name           11060 non-null object
player_fifa_api_id    11060 non-null int64
birthday              11060 non-n

### Which team scored the most points when playing at home?  

In [136]:
# Find max sum of goals 
home_team_id = df['Match'][['home_team_api_id', 'home_team_goal']].groupby(
    'home_team_api_id', sort=False).sum().idxmax(axis=0)
home_team_id[0]

8633

In [138]:
home_team_name = df['Team'][df['Team']['team_api_id'] == home_team_id[0]]['team_long_name'].reset_index(drop=True)
home_team_name[0]

'Real Madrid CF'

### Did this team also score the most points when playing away?  

In [135]:
away_team_id = df['Match'][['away_team_api_id', 'away_team_goal']].groupby(
    'away_team_api_id', sort=False).sum().idxmax(axis=0)
away_team_id[0]

8634

In [134]:
away_team_name = df['Team'][df['Team']['team_api_id'] == away_team_id[0]]['team_long_name'].reset_index(drop=True)
away_team_name[0]

'FC Barcelona'

The home team that scored the most goals was not the same as the away team that scored the most goals.

### How many matches resulted in a tie?  

In [144]:
tied = sum((df['Match']['home_team_goal'] - df['Match']['away_team_goal']) == 0)
tied

6596

### How many players have Smith for their last name? How many have 'smith' anywhere in their name?

In [160]:
first_last = df['Player']['player_name'].str.split(pat=' ', n=1, expand=True)
first_last.columns = ['First', 'Last']

In [165]:
lastname_smith = sum(first_last['Last'].str.lower() == 'smith')
lastname_smith

15

In [242]:
first_contains_smith = sum(first_last['First'].str.lower().str.contains('smith') == True)
first_contains_smith 

0

In [241]:
last_contains_smith = sum(first_last['Last'].str.lower().str.contains('smith') == True)
last_contains_smith

18

### What was the median tie score? Use the value determined in the previous question for the number of tie games. 
*Hint:* PostgreSQL does not have a median function. Instead, think about the steps required to calculate a median and use the [`WITH`](https://www.postgresql.org/docs/8.4/static/queries-with.html) command to store stepwise results as a table and then operate on these results. 

In [215]:
med_tie = df['Match'][(df['Match']['home_team_goal'] - df['Match']['away_team_goal']) == 0]['home_team_goal'].median()
med_tie

1.0

### What percentage of players prefer their left or right foot? *Hint:* Calculate either the right or left foot, whichever is easier based on how you setup the problem.

In [220]:
prefer_right = sum(df['Player_Attributes']['preferred_foot'] == 'right') / df['Player_Attributes'].shape[0]
prefer_right

0.7523127765276283