## Problem Statement
<ol style="font-size: 20px">
    <li>How many injuries in this dataset involve a skateboard?</li>
    <li>Of those injuries, what percentage were male and what percentage were female?</li>
    <li>What was the average age of someone injured in an incident involving a skateboard?</li>
</ol>

## Approach
<p style="font-size: 20px"> The analysis has three steps as follows: </p>
<ol style="font-size: 20px">
    <li> Find all incidents involving skateboards. Narrative of each row is checked for occurence of word similar to "skateboard" using regular expressions and levenshtein distance algorithm for more accuracy. Regular expressions can't address for the errors inside the word but levenshtein can. Levenshtein can address for errors in suffix and prefix of word but regular expressions can. </li>
    <li> Calculate percentages of male and female with injuries involving skateboard. </li>
    <li> Calculate mean age of people with injuries involving skateboard. </li>
</ol>

In [1]:
# Loading project configuiration
%run 'load_files.ipynb'

In [2]:
import re
import distance

## Step 1: Finding skateboard and similar words in narrative

In [3]:
# Spellcheck using levenshtein distance algorithm
def match_distance(words, string):
    components = re.split("[\s/.]+", string)
    for component in components:
        for word in words:
            result = distance.levenshtein(word.lower(), component.lower())
            if result <= 4:
                return result
    return -1

In [4]:
# Function to be used with apply method later
def apply_distance_check(x):
    patterns = ['skateboard', 'skateboarding', 'skateboarder']
    return match_distance(patterns, x)

In [5]:
# Spellcheck using Regex
def match_regex(pattern, string):
    searcher = re.compile(pattern, re.IGNORECASE)
    if searcher.search(string):
        return 1
    return -1

In [7]:
# Function to be used with apply method later
def apply_regex_check(x):
    pattern = 'skateboard'
    return match_regex(pattern, x)

In [8]:
# Creating a copy of main_df to modify in later steps
skateboard_df = main_df
# Applying distance function for spellcheck
skateboard_df["is_skateboard_distance"] = main_df["narrative"].apply(apply_distance_check)

<b>Frequency distribution at different levenshtein distances</b>

In [9]:
skateboard_df["is_skateboard_distance"].value_counts()

-1    64760
 0      271
 4      269
 3      163
 1       24
 2       12
Name: is_skateboard_distance, dtype: int64

In [10]:
# Applying regex function for spellcheck
skateboard_df["is_skateboard_regex"] = main_df["narrative"].apply(apply_regex_check)

<b>Frequency distribution for regex matches, 1 implies match and -1 no match</b>

In [11]:
skateboard_df["is_skateboard_regex"].value_counts()

-1    65033
 1      466
Name: is_skateboard_regex, dtype: int64

<b> Next four cells helps to find the critical distance to be considered. Distance of 2 is taken to be critical & also corresponds to general notions of NLP. </b>

In [12]:
skateboard_df[(skateboard_df["is_skateboard_distance"] ==4) & (skateboard_df["is_skateboard_regex"] == -1)]

Unnamed: 0,CPSC Case #,trmt_date,psu,weight,stratum,age,sex,race,race_other,diag,diag_other,body_part,disposition,location,fmv,prod1,prod2,narrative,is_skateboard_distance,is_skateboard_regex
321,140160553,1/20/14,90,6.6704,C,15,Female,Asian,,64,,92,1,9,0,5031,,"15 YOF WAS SNOWBOARDING IN THE SKI AREA, & CAU...",4,-1
385,141240274,12/15/14,99,82.3076,S,46,Female,None listed,,71,KNEE PAIN,35,1,0,0,1114,,46YOF WAS MOVING CARBOARD BOX WITH HER LEG AND...,4,-1
495,140300468,1/25/14,54,41.0402,M,17,Male,White,,64,,34,1,9,0,5031,,"17YOM SNOWBOARDING, FELL; SNOWBOARD HIT KNEE/J...",4,-1
829,140213960,1/27/14,59,80.0213,S,22,Male,White,,52,,75,1,9,0,5031,,22 YO MALE FELL WHILE SNOWBOARDING HITTING HEA...,4,-1
832,140341835,3/9/14,59,80.0213,S,35,Female,White,,52,,75,1,9,0,5031,,35 YO FEMALE FELL WHILE SNOWBOARDING HITTING H...,4,-1
840,140244597,2/12/14,59,80.0213,S,24,Male,White,,59,,88,1,9,0,5031,,24 YO MALE FELL WHILE SNOWBOARDING HITTING LIP...,4,-1
853,140148124,1/18/14,59,80.0213,S,14,Male,White,,52,,75,1,9,0,5031,,"14 YO MALE FELL WHILE SNOWBOARDING , DX CONCUS...",4,-1
859,140341831,3/10/14,59,80.0213,S,18,Female,White,,64,,33,1,9,0,5031,,"18 YO FEMALE FELL WHILE SNOWBOARDING , DX SPRA...",4,-1
860,140312386,2/20/14,59,80.0213,S,12,Female,White,,57,,34,1,9,0,5031,,"12 YO FEMALE FELL WHILE SNOWBOARDING , DX FX W...",4,-1
896,140213958,1/26/14,59,80.0213,S,15,Female,White,,64,,35,1,9,0,5031,,"15 YO FEMALE FELL WHILE SNOWBOARDING , DX KNEE...",4,-1


In [13]:
skateboard_df[(skateboard_df["is_skateboard_distance"] ==3) & (skateboard_df["is_skateboard_regex"] == -1)]

Unnamed: 0,CPSC Case #,trmt_date,psu,weight,stratum,age,sex,race,race_other,diag,diag_other,body_part,disposition,location,fmv,prod1,prod2,narrative,is_skateboard_distance,is_skateboard_regex
3041,140142351,1/19/14,21,14.3089,V,32,Male,None listed,,59,,93,1,1,0,4076,,32YM ACC STUBBED TOE AGAINST THE BASEBOARD OF ...,3,-1
6341,140807200,7/21/14,101,99.704,M,19,Male,None listed,,55,,30,1,9,0,1264,,19YOM WAS WAKEBOARDING AND INJURED LEFT SHOULD...,3,-1
8607,140753193,7/24/14,84,87.296,S,10,Female,White,,52,,75,1,1,0,1884,,"10 YOF,PT HAS RECENT H\O OF CONCUSSION. HAS BE...",3,-1
21393,141046727,10/13/14,65,82.3076,S,36,Male,White,,51,,32,1,1,0,312,,36YOM WITH SECOND DEGREE BURNS TO RIGHT ELBOW ...,3,-1
22136,150115269,12/27/14,68,99.704,M,28,Male,White,,57,,82,1,9,0,5040,,28YOM FRACTURED HAND FELL WHILE RIDING HIS BIK...,3,-1
23297,140332847,3/12/14,95,14.3089,V,22,Male,Other / Mixed Race,HISPANIC,62,,75,1,1,0,1842,1893.0,CHI. 22 YOM MISSED LAST STEP HITTING HEAD ON B...,3,-1
24355,140713606,6/21/14,101,89.7336,M,20,Female,None listed,,71,PAIN,31,1,9,0,1264,,20YOF COMPLAINED OF BACK PAIN AFTER WAKEBOARDI...,3,-1
26150,140834344,8/13/14,85,82.3076,S,17,Male,White,,52,,75,1,9,0,1264,,17YOM FELL WHILE WAKEBOARDING POSSIBLE CONCUSS...,3,-1
28921,150135203,12/26/14,66,82.3076,S,12,Female,None listed,,53,,83,1,1,0,1884,,"TOE CONT.: 12YOF BROTHER STOLE GIFT FROM HER, ...",3,-1
31556,140741047,5/14/14,14,41.0402,M,45,Male,White,,64,,80,1,9,0,852,1264.0,45YOM R UPPER ARM CAUGHT IN ROPE WHILE WATERB...,3,-1


In [14]:
skateboard_df[(skateboard_df["is_skateboard_distance"] ==2) & (skateboard_df["is_skateboard_regex"] == -1)]

Unnamed: 0,CPSC Case #,trmt_date,psu,weight,stratum,age,sex,race,race_other,diag,diag_other,body_part,disposition,location,fmv,prod1,prod2,narrative,is_skateboard_distance,is_skateboard_regex
16066,140548417,5/23/14,78,81.576,M,17,Male,None listed,,53,,87,1,0,0,1333,,17 YOM NJURED AFTER FALLING OFF SKATEBAORD. DX...,2,-1
17888,140703075,6/24/14,67,14.3089,V,21,Male,None listed,,53,,80,1,0,0,1333,,DX RT UPPER EXT SKIN ABRASION 21YOM ROAD RASH ...,2,-1
20239,150234735,7/27/14,42,74.3851,L,16,Male,Other / Mixed Race,HISPANIC,57,,75,1,5,0,1333,,16YOM PAIN TO HEAD WHEN FALL TO GROUND WHILE S...,2,-1
42393,140333396,3/12/14,51,74.3851,L,50,Male,Other / Mixed Race,HISPANIC,57,,33,1,0,0,1333,676.0,"50 YO M,LAST NIGHT PLAYING W/ DAUGHTER,SHOWING...",2,-1
47131,140660659,6/21/14,17,14.3089,V,21,Male,None listed,,57,,76,5,4,0,1333,,21YOM FX MANDIBLE- FELL SKATEBAORD,2,-1
50558,141123371,11/8/14,21,15.6716,V,19,Female,None listed,,64,,34,1,0,0,1333,,19YF WRIST PAIN SINCE FOOSH FROM SKATEOBARD YT...,2,-1
52554,140553148,5/25/14,95,14.3089,V,29,Male,White,,64,,35,1,4,0,1333,,RT KNEE STRAIN.29YOM WAS SKATEBAORDING AND PUT...,2,-1
56596,140430392,4/9/14,58,14.3089,V,18,Male,None listed,,57,,30,1,0,0,1333,,AN 18 YOM FELL WHILE SAKTEBOARDING AND INJURED...,2,-1
56763,140755832,7/20/14,37,5.7174,C,5,Male,None listed,,59,,76,1,0,0,1333,,5 YO M WAS PLAYING W/ ANOTHER CHILD WHEN HE WA...,2,-1


In [15]:
skateboard_df[(skateboard_df["is_skateboard_distance"] ==1) & (skateboard_df["is_skateboard_regex"] == -1)]

Unnamed: 0,CPSC Case #,trmt_date,psu,weight,stratum,age,sex,race,race_other,diag,diag_other,body_part,disposition,location,fmv,prod1,prod2,narrative,is_skateboard_distance,is_skateboard_regex
8724,140701314,6/23/14,68,89.7336,M,9,Female,White,,64,,34,1,5,0,1333,,9YOF SPRAINED WRIST FELL OFF HER SKATEBOAD ONT...,1,-1


In [16]:
# Creating final flag for skateboard involved
def create_final_flag(x):
    if (x["is_skateboard_distance"] <=2 and x["is_skateboard_distance"] >= 0) or x["is_skateboard_regex"] == 1:
        return 1
    else:
        return 0

skateboard_df["is_skateboard_final"] = skateboard_df.apply(create_final_flag, axis=1)

## Step 1 Answer: Total number of Skatedboard related injuries = 476

In [17]:
total_skateboard_injuries = skateboard_df["is_skateboard_final"].value_counts()[1]
print("Number of injuries associated with skateboard is", total_skateboard_injuries)

Number of injuries associated with skateboard is 476


## Step 2: Percentages of males and females with injuries involving skateboard

In [18]:
# Checking for missing values in sex column
print("Number of missing values", num_missing(main_df, 'sex'))
# Checking for labels of sex variable
skateboard_df["sex"].value_counts()

Number of missing values 0


Male      35503
Female    29996
Name: sex, dtype: int64

In [19]:
# Get female and male injuries related to skateboard
female_skateboard_injuries = skateboard_df[skateboard_df["is_skateboard_final"] == 1] \
                                .groupby(["sex"])["is_skateboard_final"].sum()["Female"]
male_skateboard_injuries = skateboard_df[skateboard_df["is_skateboard_final"] == 1] \
                                .groupby(["sex"])["is_skateboard_final"].sum()["Male"]

## Step 2 Answer: 
<p style="font-size: 20px">Male Percentage with such injuries = 82.35 %</p>
<p style="font-size: 20px">Female Percentage with such injuries = 17.65 %</p>

In [20]:
# Get %female and %male of the total injuries related to skateboard
male_percentage = (male_skateboard_injuries / total_skateboard_injuries) * 100
female_percentage = (female_skateboard_injuries / total_skateboard_injuries) * 100
print("Male percentage of total injuries related with skateboard", male_percentage, "%")
print("Female percentage of total injuries related with skateboard", female_percentage, "%")

Male percentage of total injuries related with skateboard 82.35294117647058 %
Female percentage of total injuries related with skateboard 17.647058823529413 %


## Step 3: Average age of someone with injury involving skateboard

In [22]:
# Checking for missing values in sex column
print("Number of missing values", num_missing(main_df, 'age'))

Number of missing values 0


## Step 3 Answer: Mean Age = 18 years(approx)

In [42]:
mean_age = skateboard_df[skateboard_df["is_skateboard_final"] == 1]["age"].mean()
print("Mean age of people injuries related to skateboard", mean_age, "years.")

Mean age of people injuries related to skateboard 18.044117647058822 years.
