![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Data Cleaning

This notebook will explore one of the important steps of data science, data cleaning. This process helps make sure the data we will use is accurate and reliable. We will fix mistakes, handle missing information, maybe remove weird outliers, and make everything nice and consistent.

To start, we will use a dataset about NBA team colors from [teamcolorcodes.com](https://teamcolorcodes.com/nba-team-color-codes).

In [None]:
import pandas as pd
nba_colors = pd.read_csv('data/nba-colors.csv')
nba_colors

We can can see that the "Color" columns (Color 1 to Color 5) include the color name and the [RGB color code](https://en.wikipedia.org/wiki/Web_colors). 

To make it more useful we will divide those into separate columns by splitting on the `#` sign.

In [None]:
for i in range(1,6):
    color_names = nba_colors[f'Color {i}'].str.split(' #').str[0]
    color_codes = '#'+nba_colors[f'Color {i}'].str.split('#').str[1]
    nba_colors[f'Color {i} Name'] = color_names
    nba_colors[f'Color {i}'] = color_codes
nba_colors

Since not every team has the same number of colors, there are a lot of [NaN](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) ("not a number") values. We could replace those with whatever we would like, so let's fill them with blank spaces.

In [None]:
nba_colors.fillna('')

The [next notebook](06-getting-more-data.ipynb) will introduce ways to get more basketball data to work with.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)