# Data Preprocessing:
Data preprocessing is a predominant step in machine learning to yield highly accurate and insightful results. Greater the quality of data, greater is the reliance on the produced results. **Incomplete, noisy, and inconsistent data** are the properties of large real-world datasets. Data preprocessing helps in increasing the quality of data by filling in missing incomplete data, smoothing noise and resolving inconsistencies.

* **Incomplete data** can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transaction data. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions.
* There are many possible reasons for **noisy data** (having incorrect attribute values). The data collection instruments used may be faulty. There may have been human or computer errors occurring at data entry. Errors in data transmission can also occur. Incorrect data may also result from inconsistencies in naming conventions or data codes used, or inconsistent formats for input fields, such as date.

There are a number of data preprocessing techniques available such as,
1. **Data Cleaning**
2. **Data Integration**
3. **Data Transformation**
4. **Data Reduction**

* **Data cleaning** can be applied to filling in missing values, remove noise, resolving inconsistencies, identifying and removing outliers in the data. 
* **Data integration** merges data from multiple sources into a coherent data store, such as a data warehouse. 
* **Data transformations**, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. 
* **Data reduction** can reduce the data size by eliminating redundant features, or clustering, for instance. 

**Reference**: Data Mining:Concepts and Techniques Second Edition, Jiawei Han, Micheline Kamber.

**PS:** This is my first kaggle notebook contribution. Hope you like it!!

# Import the required libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from operator import itemgetter
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import (GradientBoostingRegressor, GradientBoostingClassifier)

### Load the dataset for training and testing

In [None]:
dataset = pd.read_parquet('../data/hit_song_prediction_ismir2020/processed/msd_bb_mbid_cleaned_matches_ab_unique.parquet')
dataset['weeks'] = dataset['weeks'].fillna(0)
dataset['peakPos'] = dataset['peakPos'].fillna(150)
dataset['hit'] = (dataset['peakPos'] < 100.5).astype(int)

# 1. Data Cleaning

Select used columns.

In [None]:
import hit_prediction_code.common as common

data = dataset[common.get_columns_matching_list(dataset.columns, common.all_no_year_list()).union(['uuid'])]

### 1.1 Find the missing percentage of each columns in dataset.

In [None]:
def find_missing_percent(data):
    """
    Returns dataframe containing the total missing values and percentage of total
    missing values of a column.
    """
    miss_df = pd.DataFrame({'ColumnName':[],'TotalMissingVals':[],'PercentMissing':[]})
    for col in data.columns:
        sum_miss_val = data[col].isnull().sum()
        percent_miss_val = round((sum_miss_val/data.shape[0])*100,2)
        miss_df = miss_df.append(dict(zip(miss_df.columns,[col,sum_miss_val,percent_miss_val])),ignore_index=True)
    return miss_df

miss_df = find_missing_percent(data)

In [None]:
'''Displays columns with missing values'''
missing = 0.0

missing_df = miss_df[miss_df['PercentMissing']>missing].sort_values(by='PercentMissing', ascending=False)
display(missing_df)
print("\n")
print(f"Number of columns with more than {missing}% missing values:{str(missing_df.shape[0])}")

In [None]:
'''Segregate the numeric and categoric data'''
numeric_cols = data.select_dtypes(['float', 'int']).columns
categoric_cols = data.select_dtypes('object').columns

# 2. Data Visualization

In [None]:
def plot_histogram(data, col1, col2, last_one=False):
    """
    Plot the histogram for the numerical columns.
    
    Freedman-Diaconis Rule:
    Freedman-Diaconis Rule is a rule to find the optimal number of bins.
    Bin width: (2 * IQR)/(N^1/3)
    N - Size of the data
    Number of bins : (Range/ bin-width)
    
    Disadvantage: The IQR might be zero for certain columns. In
    that case the bin width might be equal to infinity. In that case 
    the actual range of the data is returned as bin width.
    
    Sturges Rule:
    Sturges Rule is a rule to find the optimal number of bins.
    Bin width: (Range/ bin-width)
    N - Size of the data
    Number of bins : ceil(log2(N))+1
    
    """
    number_of_bins = 40
    
    freq1, bin_edges1 = np.histogram(data[col1], bins=number_of_bins)
    freq2, bin_edges2 = np.histogram(data[col2], bins=number_of_bins)
        
    if(last_one!=True):
        plt.figure(figsize=(45,18))  
        ax1 = plt.subplot(1,2,1)
        ax1.set_title(col1,fontsize=45)
        ax1.set_xlabel(col1,fontsize=40)
        ax1.set_ylabel('Frequency',fontsize=40)
        data[col1].hist(bins=bin_edges1,ax = ax1, xlabelsize=30, ylabelsize=30)   
    else:
        plt.figure(figsize=(20,10))
        ax1 = plt.subplot(1,2,1)
        ax1.set_title(col1,fontsize=25)
        ax1.set_xlabel(col1,fontsize=20)
        ax1.set_ylabel('Frequency',fontsize=20)
        data[col1].hist(bins=bin_edges1,ax = ax1, xlabelsize=15, ylabelsize=15)
    
    if(last_one != True):
        ax2 = plt.subplot(1,2,2)
        ax2.set_title(col2,fontsize=45)
        ax2.set_xlabel(col2,fontsize=40)
        ax2.set_ylabel('Frequency',fontsize=40)
        data[col2].hist(bins=bin_edges2, ax = ax2, xlabelsize=30, ylabelsize=30)


In [None]:
hist_cols = list(filter(lambda c: c.startswith('highlevel'), numeric_cols))
# hist_cols = data.columns

hist_cols = sorted(hist_cols)
for i in range(0,len(hist_cols),2):
    if(i == len(hist_cols)-1):
        plot_histogram(data, hist_cols[i], hist_cols[i], True)
    else:
        plot_histogram(data, hist_cols[i], hist_cols[i+1])
        

In [None]:
# strange_data = ((data['highlevel.gender.all.male'] > 0.377) & (data['highlevel.gender.all.male'] < 0.378)) | ((data['highlevel.mood_relaxed.all.relaxed'] > 0.808) & (data['highlevel.mood_relaxed.all.relaxed'] < 0.809))
strange_data = data['highlevel.mood_electronic.all.electronic'] > 0.979

strange_info = dataset.merge(data[strange_data], on=['uuid'])
display(strange_info['highlevel.mood_electronic.all.electronic_x'].describe())

display(strange_info['weeks'].plot.hist())
electronic = list(filter(lambda c: c.startswith('metadata.version'), strange_info.columns))
display(strange_info[['lastfm_listener_count', 'lastfm_playcount', 'weeks', 'peakPos', 'highlevel.mood_electronic.all.electronic_x'] + electronic])

In [None]:
clean_data = data[~strange_data]

final_data = dataset.merge(clean_data, on=['uuid'])[['weeks', 'peakPos']]

        
# display(final_data['hit'].plot.hist())

for i in range(0,len(hist_cols),2):
    if(i == len(hist_cols)-1):
        plot_histogram(clean_data, hist_cols[i], hist_cols[i], True)
    else:
        plot_histogram(clean_data, hist_cols[i], hist_cols[i+1])

# 3. Data Transformation
### 3.1 Skewed data:
![](https://miro.medium.com/max/1200/1*nj-Ch3AUFmkd0JUSOW_bTQ.jpeg)

* If skewness is less than -1 or greater than 1, the distribution is **highly skewed**.
* If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is **moderately skewed**.
* If skewness is between -0.5 and 0.5, the distribution is **approximately symmetric**.
* If skewness is 0 the distribution is **symmetric**

#### 3.1.1 **Positively skewed data:**
* **Log transformation** (when the data is highly skewed)
    * log(X) - if no zero values are present
    * log(C + X) - if zero values are present
        * C is a constant added so that the smallest value will be equal to 1.
* **Square root transformation** (when the data is moderately skewed)
    * sqrt(X)
    
#### 3.1.2 **Negatively skewed data:**
* Reflect and Log transformation
    * log(K - X) - K is a constant from which the values are subtracted so that the smallest value is 1.
    * (K - X) makes the large number small and the small number large so the negatively skewed data becomes positively skewed.
* Reflect and Square root transformation
    * sqrt(K - X) 



In [None]:
def find_skewness(data, numeric_cols):
    """
    Calculate the skewness of the columns and segregate the positive
    and negative skewed data.
    """
    skew_dict = {}
    for col in numeric_cols:
        skew_dict[col] = data[col].skew()

    skew_dict = dict(sorted(skew_dict.items(),key=itemgetter(1)))
    positive_skew_dict = {k:v for (k,v) in skew_dict.items() if v>0}
    negative_skew_dict = {k:v for (k,v) in skew_dict.items() if v<0}
    return skew_dict, positive_skew_dict, negative_skew_dict

def add_constant(data, highly_pos_skewed):
    """
    Look for zeros in the columns. If zeros are present then the log(0) would result in -infinity.
    So before transforming it we need to add it with some constant.
    """
    C = 1
    for col in highly_pos_skewed.keys():
        if(col != 'SalePrice'):
            if(len(data[data[col] == 0]) > 0):
                data[col] = data[col] + C
    return data

def log_transform(data, highly_pos_skewed):
    """
    Log transformation of highly positively skewed columns.
    """
    for col in highly_pos_skewed.keys():
        if(col != 'SalePrice'):
            data[col] = np.log10(data[col])
    return data

def sqrt_transform(data, moderately_pos_skewed):
    """
    Square root transformation of moderately skewed columns.
    """
    for col in moderately_pos_skewed.keys():
        if(col != 'SalePrice'):
            data[col] = np.sqrt(data[col])
    return data

def reflect_sqrt_transform(data, moderately_neg_skewed):
    """
    Reflection and log transformation of highly negatively skewed 
    columns.
    """
    for col in moderately_neg_skewed.keys():
        if(col != 'SalePrice'):
            K = max(data[col]) + 1
            data[col] = np.sqrt(K - data[col])
    return data

In [None]:
"""
If skewness is less than -1 or greater than 1, the distribution is highly skewed.
If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
"""
skew_dict, positive_skew_dict, negative_skew_dict = find_skewness(data, numeric_cols)
moderately_pos_skewed = {k:v for (k,v) in positive_skew_dict.items() if v>0.5 and v<=1}
highly_pos_skewed = {k:v for (k,v) in positive_skew_dict.items() if v>1}
moderately_neg_skewed = {k:v for (k,v) in negative_skew_dict.items() if v>-1 and v<=0.5}
highly_neg_skewed = {k:v for (k,v) in negative_skew_dict.items() if v<-1}

display(highly_pos_skewed, moderately_pos_skewed, moderately_neg_skewed)

'''Transform data.'''
# data = add_constant(data, highly_pos_skewed)
# data = log_transform(data, highly_pos_skewed)
# data = sqrt_transform(data, moderately_pos_skewed)
# data = reflect_sqrt_transform(data, moderately_neg_skewed)