# Practice Lab: Analyzing NBA games - Validating your data

You work at a sports data consulting firm that provides insights to basketball teams, broadcasters, and sports media outlets. Your job is to analyze player and team performance to identify trends, optimize strategies, and create engaging content for basketball fans. 

To complete the task you decide to analyze the NBA Boxscore Dataset, which includes 3 tables:

- **game_info**: contains information about each game between two teams, including things like the scores and the outcome.
- **team_stats**: contains detailed statistics for each team in each game, such as points scored, rebounds, assists, and more.
- **player_stats**: contains individual game stats for each player, including points, assists, rebounds, and other performance details.

**The database is extensive, so queries might take a bit longer to complete.**

## Data Schema
The next diagram shows the data schema. For simplicity, it only shows a subset of columns for each table.

<div style="text-align: center">
    <img src="imgsL2/NBA-db-relation.png" width=400>
</div>

For more details on each table, please take a look at the [🔗dataset explorer](https://www.kaggle.com/datasets/lukedip/nba-boxscore-dataset)

## General instructions
- **Replace any instances of `None` with your own code**. All `None`s must be replaced.
- **Compare your results with the expected output** shown below the code.
- **Check the solution** using the expandable cell to verify your answer. If needed, you can copy the code and paste it into the cell

Happy coding!

<div style="background-color: #FAD888; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
<strong>Important note</strong>: Code blocks with None will not run properly. If you run them before completing the exercise, you will likely get an error. 
</div>

## Table of Contents
- [Step 1: Import Modules](#import-modules)
- [Step 2: Connect to the Database](#connect-to-the-database)
- [Step 3: Data Validation](#data-validation)
    - [Team Names](#team-names)
    - [Minutes Played](#minutes-played)

<a id="import-modules"></a>

## Import Modules
Begin by importing sqlite3 and pandas modules.

In [None]:
import sqlite3
import pandas as pd

<a id="connect-to-the-database"></a>

## Connect to the Database

Next, you need to establish a connection to the SQLite database to run queries and retrieve the data.

In [None]:
# Connect to the SQLite database
connection = sqlite3.connect("../NBA-Boxscore-Database.db")

# check the connection with a small query
query_first_line = """
SELECT team, MP as 'minutes played' 
FROM team_stats 
LIMIT 1
""" 
pd.read_sql_query(query_first_line, connection)

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>

<img src="imgsL2/conn_check.png" width="150">
</details>

<a id="data-validation"></a>

## Data Validation
Ensuring the accuracy and consistency of the data is crucial before conducting any meaningful analysis. Since your insights will influence team strategies and media narratives, it's essential to validate the integrity of the dataset.
<a id="team-names"></a>

### Team Names
First, you will start by looking into the team's names in the `team_stats` table to make sure they are consistent. You should expect 30 teams.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%">


**▶▶▶ Directions**
1. Write a SQLite query to: 
    - Select the **unique** team names from the `team` column in the `team_stats` table.
    - Use the alias `team_names`.

2. Make sure no team is repeated.

</div>

In [None]:
#### START CODE HERE ###

# write the SQL query
query_team_names = """
SELECT DISTINCT(None) AS None 
FROM None;
"""

### END CODE HERE ###

# execute the query
df_team_names = pd.read_sql_query(query_team_names, connection)

# show the results
df_team_names

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>

<img src="imgsL2/team_names.png" width="70">
</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# write the SQL query
query_team_names = """
SELECT DISTINCT(team) AS team_names 
FROM team_stats;
"""

```
</details>

The NBA currently has 30 teams but your results returned 31. Turns out CHA and CHO are the same team from Charlotte. They rebranded in 2014 so CHO represents the current moniker. You should watch out for this kind of detail in your own projects because it might affect your results. For example, you might think you have missing data, but it's actually just categorized under a different name. 

<a id="minutes-played"></a>

### Minutes Played
The maximum minutes played (sum of individual players' minutes) in a basketball game is 290 minutes. Make sure all records are in accordance with that.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%">

**▶▶▶ Directions**
1. Write a SQLite query to: 
    - Count the number of matches that lasted more than 290 minutes. Use the `MP` column from the `team_stats` table.
    - Use the alias `num_inconsistent_MP`.

2. Are there any matches with a wrong duration?
</div>

In [None]:
#### START CODE HERE ###

# write the SQL query
query_minutes_played = """
SELECT COUNT(None) AS None
FROM None
WHERE None
"""

### END CODE HERE ###

# execute the query
df_minutes_played = pd.read_sql_query(query_minutes_played, connection)

# show the results
df_minutes_played

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>
<img src="imgsL2/mp.png" width="180">

</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# write the SQL query
query_minutes_played = """
SELECT COUNT(*) AS 'num_inconsistent_MP'
FROM team_stats
WHERE MP > 290
"""
```
</details>

As you can see, there are 30 entries where the minutes played is larger than 290. In order to investigate the matter further, you decide to look at how often these times appear across different games. To achieve this you can group the data by `game_id`. You can reuse much of the previous query.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%">

**▶▶▶ Directions**
1. Write a SQLite query to: 
    - Count the number of matches that lasted more than 290 minutes. Use the `MP` column from the `team_stats` table.
    - Use the alias `num_inconsistent_MP`.
    - Group the data by `game_id`.

2. How many entries with a wrong duration are there for each match? Why could that be?.
</div>

In [None]:
### START CODE HERE ###

# write the SQL query
query_minutes_played = """
SELECT COUNT(None) AS None
FROM None
WHERE None
GROUP BY None;
"""

### END CODE HERE ###

# execute the query
df_minutes_played = pd.read_sql_query(query_minutes_played, connection)

# show the results
df_minutes_played

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>
<img src="imgsL2/mp_group.png" width="180">

</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# write the SQL query
query_minutes_played = """
SELECT game_id, COUNT(*) AS 'num_inconsistent_MP'
FROM team_stats
WHERE MP > 290
GROUP BY game_id;
"""
```
</details>

As you can see, you get 15 games in total, with 2 invalid entries each. Now the 30 invalid entries start making more sense. These were 15 invalid games, with 2 entries each - probably one for each team. Feel free to investigate this issue further yourself.

Finally, run the next cell to close the connection.

In [None]:
connection.close()

Congratulations for making it until the end of this lab. You will keep working on this dataset in Lesson 2. Hope you enjoyed it! 