# Part 2: Data Preprocessing Cheatsheet

**Remove or fill any missing data.\
Remove unnecessary or repetitive features.\
Convert categorical string features to dummy variables.**

### Missing data

**First i explore the missing data columns to decide which i should keep, discard, or fill in.**

In [None]:
# First find the length of the dataframe
len(df)

# Then find missing data quantities
missin_data = df.isna().sum() 
missin_data_percentage = missin_data*100 / len(df)

# Some columns have too many unique values to try to convert to dummy values:
df[column].unique().size # Find out how many different values a column has
df[column].value_counts() # Find out each value count

# Simple function to check the correlation between data
def check_corr(dataset,x):
    print(f'Correlation with the {x} column:\n')
    print(dataset.corr(numeric_only=True)[x].sort_values().drop(x))

**How to drop missing values:**

Replacing every NaN occurence on a column for the same row value of another column.

*This is useful when both columns have high correlation.*

In [None]:
def replace_one_value(replaced_val,new_val):
    if pd.isna(replaced_val):
        return new_val
    return replaced_val
# Calls the replacing function for every row with apply.
df['replaced_val'] = df.apply(lambda x: replace_mort_acc(x['replaced_val'], x['new_val']), axis=1)


Other useful functions

In [None]:
# Drops a column
df.drop('Column to drop',inplace=True,axis=1)

# Drops every single row (axis=0) or column (axis=1) in which contains NaN values.
df.dropna(inplace=True,axis=0)
df.dropna(inplace=True,axis=1)

## Categorical Variables and Dummy Variables

After dealing with missing values, it might be necessary to process string values due to their categorical columns.

In [None]:
# Finding every non-numeric column
df.select_dtypes(['object']).columns

In [None]:
# Easy converting technique when string has 'numeral'
df[string_col] = df.apply(lambda x : int(x[string_col][-4:]),axis=1) # In this example, the numeral is in the last 4 digits.

In [None]:
# Map every string value with its numeric value
new_dict = {} 
def to_int(old_val,new_val_dict):
    return new_val_dict[old_val]
    
df['string_col'] = df.apply(lambda x : to_int(x['string_col'],new_dict),axis=1)

In [None]:
# Converting categorical column to dummies and merging with the original dataframe
def str_to_dum(dataframe,y):
    """
    Transforms y string attributes into categorical columns 
    (EX: if y has 3 possible values, creates 3 new columns for each
    with value 1 if given instance has that attribute)
    :dataframe: pandas.Dataframe
    :y: string or array of strings - which columns will be replaced
    :return: the dataframe transformed
    """
    dummies = pd.get_dummies(df[y], dtype='int32',drop_first=True)
    merged = pd.concat([df,dummies],axis='columns')
    merged.drop(y,axis='columns',inplace=True)
    
    return merged

[By José H](https://github.com/dev-joseh)