# Exercise: Transform

## Instructions

Import the `songs.csv` dataset into a Pandas DataFrame. Perform exploratory analysis on the dataset. Whenever we find an issue, write code to fix it. Apply our fixes systematically, step-by-step. Be sure to include comments to explain our process.

In [496]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

songs_df = pd.read_csv("../data/songs.csv")

In [497]:
# Remove the `Index` column.
# Rename columns.
# The `Month` column contains strings. Convert it to an integer. Use a map function.
# The `Length (Duration)` column contains strings. Convert it to an integer.
# The `Year` column is a float. Be careful with 4 digit numbers versus 2.

Find the number of missing values in each column and each row. Remove rows where at least 50% of the values are missing. Then remove columns where at least 50% of the values are missing.

Hint: For rows, use the `thresh` keyword, short for "threshold".

[DataFrame.dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html#pandas.DataFrame.dropna). The `dropna` threshold is columns numbers divided by 2.

In [498]:
# Drop NA row values by 50%, drop NA column values by 50%.

Fill missing values (or "impute"). For numeric values, impute the mean. For string/categorical values, impute the mode, otherwise known as "most_frequent".

```py
imputer = SimpleImputer() # defaults to mean
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

imputer = SimpleImputer(strategy="most_frequent")
df[categorical_cols] = imputer.fit_transform(df[categorical_cols])
```

[SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

In [499]:
from sklearn.impute import SimpleImputer

Calculate descriptive statistics for numeric columns. 

Think about columns with outliers that are 3 standard deviations away. We don't want to remove those outliers. Outlier columns can range from 0 to over 50.

![Outliers 3Ïƒ](../assets/three_stddev.png)

In [500]:
# outliers 3 stddev away

Create a column that uses the `datetime64` data type. Merge year and month with the [pd.to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) method. 

The year is a float. Convert it to an integer. With our month imputation, it will also be a float. Convert it to an integer.

Remove the year and month columns.

In [501]:
# pd.to_datetime

The Top Genre column has 149 unique entries. The Artist column has 729 unique entries. We should focus on broad genres.

In [502]:
def extract_broad_genre(genre):
    broad_genre = [
        "hip hop",
        "rock",
        "pop",
        "metal",
        "soul",
        "folk",
        "country",
        "indie",
        "dance",
        "funk",
        "reggae",
        "adult",
        "dutch",
        "wave",
        "british",
        "disco",
        "mellow",
    ]
    for bg in broad_genre:
        if bg in genre:
            return bg
    return "other"


# songs_df["broad_genre"] = songs_df["top_genre"].map(extract_broad_genre)
# songs_df["broad_genre"].value_counts()

In [503]:
# songs_df[songs_df["broad_genre"] == "other"]["top_genre"].value_counts()

Create dummy variables for a categorical feature. Drop one level of each feature to end up with k-1 dummies, not k.

Hint: Use the `dropfirst` keyword. `dropfirst` defaults to `False`. Set it to `True`.

[get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) is a DataFrame, not a Series, so we need to `pd.concat` columns, not rows.

In [504]:
# get_dummies

## Save Your Clean Dataset

```py
songs_df.to_csv("../data/songs_clean.csv")
```