## Transformation

This notebook contains the code for transforming the dataframe after the cleaning process, and doing the first actual approach to analysis.

#### Importing libraries

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
import pandas_profiling
import urllib.request
import datetime
import plotly.express as px
import plotly.graph_objects as go
import re

#### Load the dataframe from the .csv file

In [3]:
df = pd.read_csv('clean_columns_v7.csv')

### Define subsets for later analysis 

##### Frequency of occurances for each variable doing a 'value_counts', as well as the corresponding percentage with the 'normalize' parameter

In [None]:
activity_freq_norm = df['Activity'].value_counts(dropna=True, normalize=True)
activity_freq = df['Activity'].value_counts(dropna=True)


In [None]:
country_freq_norm = df['Country'].value_counts(dropna=True, normalize=True)
country_freq = df['Country'].value_counts(dropna=True)


In [None]:
sex_freq_norm = df['Sex'].value_counts(dropna=True, normalize=True)
sex_freq = df['Sex'].value_counts(dropna=True)


In [None]:
fatal_freq_norm = df['Fatal(y/n)'].value_counts(dropna=True, normalize=True)
fatal_freq = df['Fatal(y/n)'].value_counts(dropna=True)


In [None]:
hemisphere_freq_norm = df['Hemisphere'].value_counts(dropna=True, normalize=True)
hemisphere_freq = df['Hemisphere'].value_counts(dropna=True)


In [None]:
age_freq_norm = df['Age'].value_counts(dropna=True, normalize=True)
age_freq = df['Age'].value_counts(dropna=True)


In [None]:
year_freq_norm = df['Year'].value_counts(dropna=True, normalize=True)
year_freq = df['Year'].value_counts(dropna=True)


In [None]:
weekday_freq_norm = df['Weekday'].value_counts(dropna=True, normalize=True)
weekday_freq = df['Weekday'].value_counts(dropna=True)


##### Define FUNCTION

'clean_activity' is a user defined function that will be called when cleaning and standarizing the 'Activity' column values.
The conditions search for the specified patterns through regex, and return the corresponding new value replacing the old one.
Then, the dataframe is updated applying this function on the specified column.

In [None]:
def clean_activity(activity):
    activity = str(activity).lower()
    if re.search(r'surf|board', activity):
        return 'surf'
    elif re.search(r'divi|dive|snor', activity):
        return 'dive'
    elif re.search(r'wad|fish', activity):
        return 'fishing'
    elif re.search(r'resc|escap|swim|bath|float', activity):
        return 'swimming'
    elif re.search(r'tread|splash|jump|stand|play', activity):
        return 'standing'
    elif re.search(r'kaya|kaja|paddl', activity):
        return 'paddling'
    elif re.search(r'shark', activity):
        return 'shark activities'
    else:
        return 'other'

df['Activity'] = df['Activity'].apply(clean_activity)

Here, a new column called 'Sex_num' is created in the dataframe, and populated with values of '1' and '2', by mapping with the original 'Sex' column which has values of 'M' and 'F'.
This was done to cover for a possible need to use numerical values in further analysis.

In [None]:
df['Sex_num'] = df['Sex'].map({'M': 1, 'F': 2})