# Olympics Basketball - Data Wrangling

In **WSOlympicsBasketball.ipynb** we generated a dataset with some stats about players performance and stored it in a CSV file. Some of these columns were not conveniently stored, as we will see now, so in this notebook we are cleaning this dataset to make easier its future use.

## 1. Load dataset

In [1]:
import pandas as pd

df = pd.read_csv("basketball_olympic_players_game_stats.csv")
df.head()

Unnamed: 0,country,vs,team_score,vs_score,date,#,Players,Min,Pts,FG,...,OREB,DREB,REB,AST,PF,TO,STL,BLK,+/-,EFF
0,Iran,Czech Republic,78,84,2507,3,Mohammadsina Vahedi,02:33,0,0/2,...,0,0,0,0,0,1,0,0,-3,-3
1,Iran,Czech Republic,78,84,2507,5,Pujan Jalalpoor,09:04,3,1/3,...,0,0,0,0,0,1,0,0,-6,0
2,Iran,Czech Republic,78,84,2507,7,Mohammad Hassanzadeh,0:0,0,0/0,...,0,0,0,0,0,0,0,0,0,0
3,Iran,Czech Republic,78,84,2507,8,Saeid Davarpanah,0:0,0,0/0,...,0,0,0,0,0,0,0,0,0,0
4,Iran,Czech Republic,78,84,2507,13,Mohammad Jamshidijafarabadi,28:36,16,7/11,...,0,1,1,7,1,7,1,0,-5,13


## 2. Explore datatypes

Let's see the datatype of each of the columns in this dataset:

In [2]:
df.dtypes

country       object
vs            object
team_score     int64
vs_score       int64
date           int64
#              int64
Players       object
Min           object
Pts            int64
FG            object
2Pts          object
3Pts          object
FT            object
OREB           int64
DREB           int64
REB            int64
AST            int64
PF             int64
TO             int64
STL            int64
BLK            int64
+/-            int64
EFF            int64
dtype: object

All of them are either strings ('object') or integers ('int64'). This is OK for the majority of the cases, but there are some exceptions:

- **date** is a column of integers with the format (d)dmm. It would be better to:
    - Convert it to a date datatype.
    - Separate it into a **day** column and a **month** column. I am taking this approach.

- **Min** is a column of strings (format MM:SS). I am transforming this column into seconds (**Sec**).

- **FG**, **2Pts**, **3Pts** and **FT** are strings (format integer/integer). I am splitting each of them into a column S (scored shots) and A (attempted shots), which are integers.

## 3. Transform ugly columns

### 3.1. FG, 2Pts, 3Pts and FT

The transformation in these 4 columns is the same: split their values by **/** into two columns.

In [3]:
for col in ("FG", "2Pts", "3Pts", "FT"):
    df[[col+"S", col+"A"]] = df[col].str.split("/", expand= True)

### 3.2. date

For **date**, we are creating a column with its last two integers (**month**) and the rest of integers will form **day** column:

In [4]:
df["day"] = [str(d)[:-2] for d in df.date.values]
df["month"] = [str(d)[-2:] for d in df.date.values]

### 3.3. Min

**Min** column is being converted to seconds (**Sec**).

In [5]:
df["Sec"] = [60*int(d.split(":")[0]) + int(d.split(":")[1]) for d in df.Min.values]

## 4. Remove ugly columns

In [6]:
for col in ("FG", "2Pts", "3Pts", "FT", "date", "Min"):
    del df[col]

## 5. Store and annotate dataset

The dataset is stored in **curated_basketball_olympic_players_game_stats.csv**, and the description and datatype of its columns can be found within **curated_basketball_olympic_players_game_stats.json**.

In [7]:
df.to_csv("curated_basketball_olympic_players_game_stats.csv", index = None)