# Cleaning Data 
it's time to learn how to clean up our data and handle missing data. The reality is every dataset has missing data or data that we need to clean up and change a bit. That's what we'll be doing in this notebook.

Let's start off by creating an example to work with

In [1]:
import pandas as pd
import numpy as np

In [4]:
people = {
    'first': ['athoug', 'jane', 'john', 'chris', np.nan, None, 'NA'],
    'last': ['alsoughayer', 'doe', 'doe', 'schafer', np.nan, np.nan, 'Missing'],
    'email': ['athoug@mail.com', 'jane@mail.com', 'john@mail.com', None, np.nan, 'Anonymous@email.com', 'NA'],
    'age': ['33', '55', '63', '36', None, None, 'Missing']
}

In [5]:
df = pd.DataFrame(people)
df

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33
1,jane,doe,jane@mail.com,55
2,john,doe,john@mail.com,63
3,chris,schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


## Dropping missing values
lets start off by dropping missing vales. As mentioned before, when working with data, we will have missing data points it's enevitable so depending on what you want to do, there are a couple of differnet ways of tackiling this.

one approach to handeling missing data is to simply __remove it__ so in our small dataset lets say we want to do some analysis and a couple of entries don't have a `first` `last` or `email` and in this case we'll just remove them.

to achive that, we can use the `dropna` method

In [6]:
df.dropna()

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33
1,jane,doe,jane@mail.com,55
2,john,doe,john@mail.com,63
6,,Missing,,Missing


as we can see it removed any row contanining `NaN` or `None` and the result was shrunk down to 4 rows when it was 7. The changes didn't presest because we didn't include teh `inplace=True` argument

__Not__ we see that we still have our custome missing values, and we'll tackle that in a bit. but before that let's explore drop more. We wrote it before without any arguments and we used the defaults. we'll run it again with the defaults written out just to get a better understanding of how it worked 

In [9]:
df.dropna(axis='index', how='any')

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33
1,jane,doe,jane@mail.com,55
2,john,doe,john@mail.com,63
6,,Missing,,Missing


We see that we got the same result (obviously) but let's explore it a bit.
- `axis` which is our first argument can be set to `index` or `columns` which will tell pandas that we want to drop `na` values when our rows are missing vallues. On the other hand `columns` would drop columns if they had missing values. 
- the second argument `how` tells it how we want to drop things. A better way to understand it is that it's teh criteria for droping a row or a column. By default it's set to `any` which means if any of the columns are missing drop it. teh other option is `all` which only drops if all the values are missing

so since it's set to `index` and `any` it will drop any row with missing values

However this might not be what we want, maybe in the analysis we're doing it's okay to have a missing email or something like that, but ther ehas to be something, it can't be an entier row of missing values. If that's the case then we can change the how argument to `all`

In [10]:
df.dropna(axis='index', how='all')

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33
1,jane,doe,jane@mail.com,55
2,john,doe,john@mail.com,63
3,chris,schafer,,36
5,,,Anonymous@email.com,
6,,Missing,,Missing


We can see that we got more values than before because it kept some missing values. 

Let's see an example of setting the axis to `column` and the how to `any` what we should get is an empty dataframe but let's put that to the test 

In [11]:
df.dropna(axis='columns', how='any')

0
1
2
3
4
5
6


This is all good and dandy but what if our analysis and ddata are more complex. say that I want to drop a row if a specific column is missing then. Let's take an example to illustrate our point.
Say we're doing analysis and we only care about the `email` so if any other data point is missing it's fine, we just care for email. 

To do that, we have to pass in a `subset` of data (subset it an argument key), and it will contain the column that we care about as such

In [12]:
df.dropna(axis='index', how='any', subset=['email'])

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33
1,jane,doe,jane@mail.com,55
2,john,doe,john@mail.com,63
5,,,Anonymous@email.com,
6,,Missing,,Missing


As we see it kept only the ones with email and removed any rows that doesn't (I'm aware that the last row is all missing but they're custom data that we'll handel later) now since we have only one value in the `subset`, the `how` is ignored be it `any` or `all` because we made it mandatory in the subset. 
So let's try an example where we care about the `email` and `last` name

In [15]:
df.dropna(axis='index', how='any', subset=['email', 'last']) 

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33
1,jane,doe,jane@mail.com,55
2,john,doe,john@mail.com,63
6,,Missing,,Missing


Now in the case of two subsets, how will work as a logical operator 
- `all` -> works like `or` so if email exists or last name then keep 
- `any` -> works like `and` so if email and last name exist keep otherwise remove

### Removing custom values
now as we see from our data frame above, we have a that last row which is filled with custom null values. 
How would we handel this? it depends on how we load our data, in this case we wrote the data our self so what we can do is the following 

we will properly replace the values with a `numpy` nan we do this by the `replace` method. 

In [16]:
df

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33
1,jane,doe,jane@mail.com,55
2,john,doe,john@mail.com,63
3,chris,schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


In [18]:
df.replace('NA', np.nan, inplace=True)
df.replace('Missing', np.nan, inplace=True)
df

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33.0
1,jane,doe,jane@mail.com,55.0
2,john,doe,john@mail.com,63.0
3,chris,schafer,,36.0
4,,,,
5,,,Anonymous@email.com,
6,,,,


well you look at that, now we have proper empty values and do custom ones. lets go back and drop na vales like we did before

In [19]:
df.dropna()

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33
1,jane,doe,jane@mail.com,55
2,john,doe,john@mail.com,63


now lets say we don't want to drop anything but just want to check for missing values then we can run the `na` or `isna` methods which will give us a mask (a table of booleans that displays `False` for non missing values and `True` for missing)

In [20]:
df.isna()

Unnamed: 0,first,last,email,age
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,True,False
4,True,True,True,True
5,True,True,False,True
6,True,True,True,True


sometimes when working with numerical values we would want to fill in the NA values with a specific value. to do that we can use the `fillna` method. let's take in an example

In [21]:
df.fillna('MISSING')

Unnamed: 0,first,last,email,age
0,athoug,alsoughayer,athoug@mail.com,33
1,jane,doe,jane@mail.com,55
2,john,doe,john@mail.com,63
3,chris,schafer,MISSING,36
4,MISSING,MISSING,MISSING,MISSING
5,MISSING,MISSING,Anonymous@email.com,MISSING
6,MISSING,MISSING,MISSING,MISSING


Let's move on to the next topic

## Casting data types
we have a new column in our sample data which is `age`. so let's say we want to get the average age of people in this sample data, but when we look at our dataframes we can see that they're strings not numbers.
To see the types of our data we can run the `dtypes` property

In [24]:
df.dtypes

first    object
last     object
email    object
age      object
dtype: object

__Note__ when it says that it's an `object` then it likely means it's a string or a mix of differnet things

So as we can see the `age` is a string value and if we wanted to calculate teh average, it won't work. Let's test it out regardless

In [25]:
df['age'].mean()

TypeError: can only concatenate str (not "int") to str

This error message might be confusing but it basically means that it can't do integer operations on a string data type. So inorder to typecast it to the proper value, then first we need to understand when we have empty `NaN` values in our data we need to cast it to a `float` (that's because `NaN` is a float under the hood

In [26]:
type(np.nan)

float

so when we try to convert to integer data, we'll get an error when it reaches the `NaN` values. They way to cast dataframes is to use the mythod `astype` and pass in the type we want.

Let's take a look 

In [27]:
df['age'] = df['age'].astype(int)

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

As we mentioned we'll get an error, so if our data didn't have any missing values it would work just fine, but in our case we do have missing values so we just cast it to float

In [28]:
df['age'] = df['age'].astype(float)

In [31]:
df.dtypes

first     object
last      object
email     object
age      float64
dtype: object

We see that our age is a float now, so if we go back and try to grab the avergae age of our dataset 

In [32]:
df['age'].mean()

46.75

## Practice on our Stack Overflow data
as always we leave the practice on this dataset to the last step so let's start off by loading in the data.

Now remember the first thing we learned is about removing missing data and we already knwo that the stack overflow data is filled with them. We saw how to `replace` the data in our small dataset now we'll learn about the other way of doing so when loading in the data

Now we can pass in a list of values that we want to treat as missing value. Here's how to do it
- create a list of missing values
- pass in the argument `na_values` and assign the list to it

In [33]:
# loading the data
na_val = ['NA', 'Missing']
df = pd.read_csv('./data/survey_results_public.csv', index_col='ResponseId', na_values=na_val)
schema_df = pd.read_csv('./data/survey_results_schema.csv', index_col='qname', na_values=na_val)

In [34]:
# adjuting max column and row view
pd.set_option('display.max_columns', 79)
pd.set_option('display.max_rows', 79)

In [35]:
# check out first 5 rows
df.head()

Unnamed: 0_level_0,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,DevType,OrgSize,PurchaseInfluence,BuyNewTool,Country,Currency,CompTotal,CompFreq,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,PlatformHaveWorkedWith,PlatformWantToWorkWith,WebframeHaveWorkedWith,WebframeWantToWorkWith,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,OpSysProfessional use,OpSysPersonal use,VersionControlSystem,VCInteraction,VCHostingPersonal use,VCHostingProfessional use,OfficeStackAsyncHaveWorkedWith,OfficeStackAsyncWantToWorkWith,OfficeStackSyncHaveWorkedWith,OfficeStackSyncWantToWorkWith,Blockchain,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOComm,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,TBranch,ICorPM,WorkExp,Knowledge_1,Knowledge_2,Knowledge_3,Knowledge_4,Knowledge_5,Knowledge_6,Knowledge_7,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
ResponseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1
1,None of these,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,I am a developer by profession,"Employed, full-time",Fully remote,Hobby;Contribute to open-source projects,,,,,,,,,,,Canada,CAD\tCanadian dollar,,,JavaScript;TypeScript,Rust;TypeScript,,,,,,,,,,,,,macOS,Windows Subsystem for Linux (WSL),Git,,,,,,,,Very unfavorable,Collectives on Stack Overflow;Stack Overflow f...,Daily or almost daily,Yes,Daily or almost daily,Not sure,,,,,,,,No,,,,,,,,,,,,,,,,,,,,Too long,Difficult,
3,"I am not primarily a developer, but I write co...","Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Friend or family member...,Technical documentation;Blogs;Programming Game...,,14.0,5.0,Data scientist or machine learning specialist;...,20 to 99 employees,I have some influence,,United Kingdom of Great Britain and Northern I...,GBP\tPound sterling,32000.0,Yearly,C#;C++;HTML/CSS;JavaScript;Python,C#;C++;HTML/CSS;JavaScript;TypeScript,Microsoft SQL Server,Microsoft SQL Server,,,Angular.js,Angular;Angular.js,Pandas,.NET,,,Notepad++;Visual Studio,Notepad++;Visual Studio,Windows,Windows,Git,Code editor,,,,,Microsoft Teams,Microsoft Teams,Very unfavorable,Collectives on Stack Overflow;Stack Overflow;S...,Multiple times per day,Yes,Multiple times per day,Neutral,25-34 years old,Man,No,Bisexual,White,None of the above,"I have a mood or emotional disorder (e.g., dep...",No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Neither easy nor difficult,40205.0
4,I am a developer by profession,"Employed, full-time",Fully remote,I don’t code outside of work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Books / Physical media;School (i.e., Universit...",,,20.0,17.0,"Developer, full-stack",100 to 499 employees,I have some influence,Other (please specify):,Israel,ILS\tIsraeli new shekel,60000.0,Monthly,C#;JavaScript;SQL;TypeScript,C#;SQL;TypeScript,Microsoft SQL Server,Microsoft SQL Server,,,ASP.NET;ASP.NET Core,ASP.NET;ASP.NET Core,.NET,.NET,,,Notepad++;Visual Studio;Visual Studio Code,Notepad++;Visual Studio;Visual Studio Code,Windows,Windows,Git,Code editor;Command-line;Version control hosti...,,,Jira Work Management;Trello,Jira Work Management;Trello,Slack;Zoom,Slack;Zoom,Very unfavorable,Collectives on Stack Overflow;Stack Overflow f...,Daily or almost daily,Yes,A few times per week,"Yes, definitely",35-44 years old,Man,No,Straight / Heterosexual,White,None of the above,None of the above,No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Easy,215232.0
5,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Stack Overflow;O...,,8.0,3.0,"Developer, front-end;Developer, full-stack;Dev...",20 to 99 employees,I have some influence,Start a free trial;Visit developer communities...,United States of America,USD\tUnited States dollar,,,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript,C#;Elixir;F#;Go;JavaScript;Rust;TypeScript,Cloud Firestore;Elasticsearch;Microsoft SQL Se...,Cloud Firestore;Elasticsearch;Firebase Realtim...,Firebase;Microsoft Azure,Firebase;Microsoft Azure,Angular;ASP.NET;ASP.NET Core ;jQuery;Node.js,Angular;ASP.NET Core ;Blazor;Node.js,.NET,.NET;Apache Kafka,npm,Docker;Kubernetes,Notepad++;Visual Studio;Visual Studio Code;Xcode,Rider;Visual Studio;Visual Studio Code,Windows,macOS;Windows,Git;Other (please specify):,Code editor,,,,,Microsoft Teams;Zoom,,Unfavorable,Collectives on Stack Overflow;Stack Overflow f...,Multiple times per day,Yes,Daily or almost daily,"Yes, definitely",25-34 years old,,,,,,,No,,,,,,,,,,,,,,,,,,,,Too long,Easy,


Let's start off, by casting some values. Let's say for the developers who answered this servy we want to calculate the average years of coding experience among all of them. 

To answer this we'll combine a couple of what we learned through out this series. 

First to answer the question, the column that will be of value is `YearsCode` so let's look at the top answers of this column

In [36]:
df['YearsCode'].head(10)

ResponseId
1     NaN
2     NaN
3      14
4      20
5       8
6      15
7       3
8       1
9       6
10     37
Name: YearsCode, dtype: object

As we can see the type is object which means it's likey a string and I can't just use the `mean` method on it, so what we should do is cast it to a `float` first 

In [37]:
df['YearsCode'] = df['YearsCode'].astype(float)

ValueError: could not convert string to float: 'More than 50 years'

Shoot we still get an error and that's because of the 'More than 50 years' value. so we have a couple of string values that aren't just numbers so, let's look at the unique values of this column so we can have an idea of what's in there

Now to view unique values in a series, then we simply use the `unique` method. (we can also use the `value_counts` method we've seen before as well) 

In [38]:
df['YearsCode'].unique()

array([nan, '14', '20', '8', '15', '3', '1', '6', '37', '5', '12', '22',
       '11', '4', '7', '13', '36', '2', '25', '10', '40', '16', '27',
       '24', '19', '9', '17', '18', '26', 'More than 50 years', '29',
       '30', '32', 'Less than 1 year', '48', '45', '38', '39', '28', '23',
       '43', '21', '41', '35', '50', '33', '31', '34', '46', '44', '42',
       '47', '49'], dtype=object)

so we see we have `nan` and that's fine but we also see that we have 2 other strings `More than 50 years` and `Less than 1 year` those are the only two values that are non integer. 

What we'll do is replace it with numbers to get the average years we want to calculate.

let's replace 
- `Less than 1 year` -> with 0 
- `More than 50 years` -> with 51

In [39]:
df['YearsCode'].replace('Less than 1 year', 0, inplace=True)
df['YearsCode'].replace('More than 50 years', 51, inplace=True)

In [40]:
df['YearsCode'].unique()

array([nan, '14', '20', '8', '15', '3', '1', '6', '37', '5', '12', '22',
       '11', '4', '7', '13', '36', '2', '25', '10', '40', '16', '27',
       '24', '19', '9', '17', '18', '26', 51, '29', '30', '32', 0, '48',
       '45', '38', '39', '28', '23', '43', '21', '41', '35', '50', '33',
       '31', '34', '46', '44', '42', '47', '49'], dtype=object)

Now let's cast the values to floats and then calculate the mean on the column

In [41]:
df['YearsCode'] = df['YearsCode'].astype(float)

In [42]:
df['YearsCode'].mean()

12.251307285752338