In real world applications it is very common to receive data with lots of unpopulated values. Most of the default built-in functions in `pandas` ignores the missing values and compute statistics in various forms. In a pandas DataFrame, missing numerical values are represented by `NaN` (`np.nan`). These are sentinel values and can be detected with a method that we have already seen: 

````
pd.isnull()
````

For example, consider the following `pandas` Series: 
````
Series = pd.Series(['apple','banana','watermelon',np.nan,'grape'])
````
We can test whether a value in the Series is missing using the expression we saw before: 

# Data Cleaning

Now that we have a solid understanding of the fundamental structures of the Python language and its Data Science packages, we can explore applications on more realistic scenarios. Most of the time, the modeling work of a Data Scientist consists of preparing the data (importing, cleaning, combining and wrangling various datasets). This takes roughly \\(80\%\\) of the modeling work time. We will explore in this section the tools to handle missing data, manipulate strings, transform the data and some other methods.

<b>Note:</b> This section is intended to be more "hands-on" so that the user can become more familiar with these techniques.


In [2]:
import pandas as pd
import numpy as np
Series = pd.Series(['apple','banana','watermelon',np.nan,'grape'])
Series

In [3]:
print(Series)
print(' ')
Series.isnull()


Let's insert `None` into our previous `pandas` Series in the first position. We can then apply the method `isnull()` to check and see what happens.


In [5]:
Series[0] = None
Series.isnull()


### Example exercises

Let's put in practice some of the methods for data cleaning in Python.


Predict the output of the following code:
````
%python
import pandas as pd
Series = pd.Series([1,2,3,np.nan,None,3.7,None,np.nan,100.3])
Series.isnull()
````
#### `PLEASE DON'T CODE`



In [8]:
import numpy as np
import pandas as pd
Series = pd.Series([1,2,3,np.nan,None,
                    3.7,None,np.nan,100.3])
Series.isnull()

# The ".isnull" method can be used for both Series and Dataframes from Pandas library.
# it always return True if the element is null and False if it's not.
# Note - For python, every "True" value is equal 1, so if you apply the "sum" method in this script it will return the number of null rows.

# Numpy is a great library for both number generator and mathematical programming. 


We explored here and in previous sections, methods to analyze and to <b>detect</b> missing values. However, we have not yet explored any method that filters out the missing values. Pandas provides a useful solution: 
````
## To return only the non-null data as well as the index of a Series, one can use: 
Series.dropna()

## This is equivalent to: 
Series[Series.notnull()]
````

Predict the output of the following code:

````
%python

Series = pd.Series([1,2,3,np.nan,None,303.5])
Series.dropna()
````
#### `PLEASE DON'T CODE`



In [11]:
#### `ANSWER:`
Series = pd.Series([1,2,3,np.nan,None,303.5])
Series.dropna()

# The "dropna()" method canbe used for both Series and DataFrames and it will drop all the na elements from the Series (by default it will not change the original object)

Predict the output of the following code:
````
Series = pd.Series([1,2,3,np.nan,None,303.5])
Series[Series > 100]
````
#### `PLEASE DON'T CODE`



In [13]:
#### `ANSWER:`
import pandas as pd
import numpy as np
Series = pd.Series([1,2,3,np.nan,None,303.5])
Series[Series < 100]

# Here, both "np.nan" and "None" are considered as NULL values. If we use a logical mask to filter our Series or Dataframe at some value ranges it will always delete the null values. 

Predict the output of the following code:

````
%python

Series[Series.isnull()]
````
#### `PLEASE DON'T CODE`


In [15]:
#### `ANSWER:`

Series[Series.isnull()]
# By using the isnull method, we are creating a boolean mask. That way we can find all the NULL values within our data.

Predict the output of the following code:

````
%python

Series[Series.notnull()]
````
#### `PLEASE DON'T CODE`


In [17]:
#### `ANSWER:`

Series[Series.notnull()]
#Series[~Series.isnull()]

# The method ".notnull" do the opposite as ".isnull" method. It will return only the rows that aren't null.
# It can be used in both Series and Dataframe pandas objects
# Other way to have the same result is by doing "~Series.isnull()"


In DataFrame objects we can specify whether we want to drop rows or columns where all the values are "NaN" or only the ones containting NaN. By default, `dropna()` drops the rows containting at least one missing value.

We can specify the method within the `dropna()` function, such as `thresh=3` (rows that contain a certain amount of information - in this case 3 non-null values) and `how=all` (rows where all the values are "NaN"). Let's see some examples below.

In [19]:
### Let's consider the following dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame([[3.,8.3,np.nan,8.7],[2.,np.nan,4.,4.5],[np.nan,8.9,4.,7.8],[np.nan,np.nan,10.2,10.2],[1.2,3.7,100.9,7.1]])
df 

What is the output if we apply `dropna()` to the previous dataframe? 


In [21]:
## Answer
df.dropna()

# the ".dropna()" method can be a little agressive in the default configuration because it will return only the lines without null values.

What if we apply the following? 

````
df.dropna(how='all')
````


In [23]:
## Answer
df.dropna(how='all')

# The option "how = 'all'" make the ".dropna()" delete all the rows that have nulls in all columns


What if the 2nd <b>and</b> 3rd columns in the 3rd line were missing?
````
%python
df.dropna(how='all')
````
#### `PLEASE DON'T CODE`

In [25]:
#### `ANSWER:`
df_2 = pd.DataFrame([[3.,8.3,np.nan,8.7],[2.,np.nan,4.,4.5],[np.nan,8.9,4.,7.8],[np.nan,np.nan,np.nan,np.nan],[1.2,3.7,100.9,7.1]])
df_2.dropna(how='all')

# Since the third line have all columns equal null it will be deleted from the final output.


If instead, we want to drop columns with missing values we could do the following:
````
%python
df.dropna(axis=1)
````
What's the output of the previous code? 

#### `PLEASE DON'T CODE`

In [27]:
#### `ANSWER:`
df.dropna(axis=1)

# By default, the ".dropna()" method will look for rows with nulls
# But if we change the axis to 1 it will look for columns, that way it will return only the columns without null values. 

What's the output of the following script? 
````
%python
df.dropna(axis=1,how='all')
````
#### `PLEASE DON'T CODE`


In [29]:
#### `ANSWER:`
df.dropna(axis=1,how='all')

# It will drop the columns that have null values in all rows.


In [30]:
df.isnull().sum()/df.shape[0]

Consider the following DataFrame


In [32]:
df = pd.DataFrame(np.random.randn(10,5))

df

In [33]:
df.iloc[:3,1:3] = np.nan
df.iloc[7:10,0:3] = np.nan
df

Which rows will remain after we apply `dropna()` to the previous dataframe? 
#### `PLEASE DON'T CODE`


#### `ANSWER:`

Rows \\(3, 4, 5\\) and \\(6\\)



We learned previously about the method `thresh` to control the minimum number of non-missing values to be kept per row. What's the output? 

````
%python

df.dropna(thresh=4)
````
#### `PLEASE DON'T CODE`


In [37]:
df.dropna(thresh= int(df.shape[0]))

#### `ANSWER:`

Rows \\(3, 4, 5\\) and \\(6\\).

The ``thresh`` is a optional parameter and it will specify how many non-NUll values the line has to be to appear. So if we have ``thresh = 3`` it will return all row that have at least three valid values (non-NULL).



We saw in the previous section how to remove missing values in various forms. However, very often, a lot of valuable information is lost if we simply remove rows or columns with a relatively small amount of missing. It is better, instead, to in fill the "gaps" with a well-defined criterion. Pandas provides a built-in method: `fillna()`. It fills the gaps with a a given value. For example, if we use the following, we fill in the gaps with 0.
````
df.fillna(0)
````


In [40]:
df

In [41]:
 
df.fillna(0)

# The ".fillna()" method will not change the original dataframe, it will output only a print with the transformed dataframe.
# To change you Dataframe permanently you can use the option "inplace = True"

It is also possible to specify the specific condition we want to use to fill different columns. We can associate any dictionary to any column, such as: 
````
dictionary = {'Column1':Value1,...,'ColumnN':ValueN} 
df.fillna(dictionary)
````

Write a script to fill the missing values of column \\(0\\) with \\(1\\) and the remaining columns with \\(0\\).

In [44]:
 
df

In [45]:
df.fillna({   0 : 1
            , 1 : df[1].mean()
            , 2 : 0
            , 3 : 0
            , 4 : 0})

df

# It's filling the first column with 1s where we have a null value
# and for all the other columns it is filling with 0s

 

What happens if we use `fillna` with `inplace`? What's an example of equivalent code to what's shown below? 
````
df.fillna(0,inplace=True)
````


#### `ANSWER:`
If we use `inplace`, the changes are made in the original dataframe. This is equivalent to `df = df.fillna(0)`


There are two other useful methods available in `pandas` to execute fast "filling" strategies: `ffill` and `bfill`. This can be set such as: 
````
df.fillna(method='ffill',inplace=True)
df.fillna(method='bfill',inplace=True)
````
To explore a bit futher the methods available to fill in missing values in a dataset, let's consider the following dataframe: 

In [49]:
df = pd.DataFrame(np.random.randn(8,4))
df.iloc[3:7,1:3] = np.nan
df

In [50]:
df.fillna(method='ffill',limit=3)


What happened to the previous dataframe when we used the methods '`ffill`' and '`limit`'?


#### `ANSWER:`
The missing values were replaced ('filled') with the previous non-missing value of each column ('ffill' stands for forward fill). In this case, for both 1 and 2 columns it is taking the values from the line 2 and filling the next 3 rows.

if we change the limit to 4 it will fill the next four rows.


Knowing that `bfill` does the opposite of `ffill`, predict the output of the following code:
````
%python

df.fillna(method='bfill',limit=3)
````
#### `PLEASE DON'T CODE`

In [54]:
#### `ANSWER`

df.fillna(method='bfill', axis = 1)

# the "bfill" method is doing the exactly opposite from "ffill" method, i.e., it is taking the value below the last NULL value and it's replacing three NULL values above this number. (because limit = 3)


It is also possible to fill in the missing values of a pandas dataframe with any function we want. For example, one can fill in the missing values with built-in functions, such as `mean()`, `max()`, or any other mathematical expression applied to a wide variety of data values. 


Write the code to fill the `NaN` values of the previous dataframe with the average values of the following series:
````
Series = pd.Series([1.2,np.nan,None,3.5,10.8])
````

In [57]:
Series = pd.Series([1.2,np.nan,None,3.5,10.8])
df.fillna(Series.mean())

# Inside the fillna() we can put a lot of different functions. Try to put other functions like var(), std(), min().

# We can use customized solutions using the apply method
# df.fillna(df.apply(lambda x: max(x) - min(x)))



In [58]:
def change_null_numeric(df, function = 'mean'):
    df_aux = df.copy()
    cols = df_aux.select_dtypes(['float64','int64']).columns.values
    
    for col in cols:
        if function == 'mean':
        
            df_aux[col].fillna(np.mean(df_aux[col]), inplace = True)
            
    return df_aux
    
change_null_numeric(df)

In [59]:
df.fillna(df.mean())

All that we have dealt with so far in this section involved rearranging the data. Data cleaning and filtering is also absolutely critical to any work of a Data Scientist. Here, we explore another method: `drop_duplicates()`. The `duplicated()` method checks whether one or more lines of a dataframe are duplicated (repeated). It returns a boolean series as shown below. We can use `drop_duplicates()` to eliminate the identified duplicate rows. Let's start with the following dataset:

In [61]:
df = pd.DataFrame({'A' : ['FirstStr','SecondStr','ThirdStr',
                          'FourthStr','FifthStr','SixthStr','SeventhStr'],
                   'B': [1,2,3,4,5,6,6],'C' : [1,2,3,4,5,6,6],'D': [6,5,4,3,2,6,6]})

df

In [62]:
df.duplicated()
# the ".duplicated()" method will return True if it find any duplicated (or more) values in a column except for the first occurrence (by default)

# use the "keep" option to change this:
# keep = 'last': excpt the last occurrence will False
# keep = 'all': all duplicates will be True

In [63]:
df.drop_duplicates()


What would be the output if we had applied `drop_duplicates()` only for columns 'B','C' and 'D'? 
````
%python

df.drop_duplicates(['B','C','D'])
````

#### `PLEASE DON'T CODE`


In [65]:
df.drop_duplicates(['B','C','D'], keep = 'last')

# We can choose some columns by given a sequence of labels to the "drop_duplicates" method.

# This method can also be used for Series objects


<b>Note:</b> as we can see, the last row is removed, given that everything except column A is a duplicate of row 5. By default it keeps the first duplicated row. In the example case above it kept row 5. If we add the following parameter the last row is kept instead. 
````
keep = 'last'
````

Often times, the transformation we need to perform on a DataFrame is much more complex, and it requires us to create a new function, or to map values in a more robust way. Consider the following fictitious DataFrame: 
````
      Company  Quantity  randVals
0     FERRARI        40         0
1  VOLKSWAGEN        30         1
2     PORSCHE       120         1
3  Volkswagen        60         0
4        ford        70         1
5        Ford        81         1
6     Ferrari        32         1
7     Peugeot        53         0
8     Renault        62         0
````
Consider the following mapping dictionary: 
````
brand_to_country = {'Ferrari':'Italy','Volkswagen':'Germany','BMW':'Germany','Ford':'USA','Peugeot':'France','Renault':'France'}
````
We can see that the name format for the "Company" column is not "standardized", that is, some words are all in lower case, others in upper case. The following methods allows us to easily convert the strings to a single format: 

````
## all lower case
str.lower()

## all capital
str.upper()

## title style (first letter capital, others lower case)
str.title()
````


Recreate the dataframe shown above from scratch and normalize the names using the string methods shown above. 


In [69]:
 

df = pd.DataFrame({'Company': ['FERRARI','VOLKSWAGEN','PORSCHE','Volkswagen',
                               'ford','Ford','Ferrari','Peugeot','Renault'],
                   'Quantity': [40, 30, 120, 60, 70, 81, 32, 53, 62],
                   'randVals' : [0, 1, 1, 0, 1, 1, 1, 0, 0]})

print('Standardizing word format: ')

df.Company = df.Company.str.upper()

print(df)

# here the "Company" variable is selected using the follow method "df.Company" and this is the same as " df['Company'] ", so it's returning a Series object.

# the .str method only works for series.


In [70]:
df = pd.DataFrame({'Company': ['FERRARI','VOLKSWAGEN','PORSCHE','Volkswagen',
                               'ford','Ford','Ferrari','Peugeot','Renault'],
                   'Quantity': [40, 30, 120, 60, 70, 81, 32, 53, 62],
                   'randVals' : [0, 1, 1, 0, 1, 1, 1, 0, 0]})

#df.Company.str.split("o")
#df['Company'].str.split("o").str.join("o")
#df['Company'].str.len()
#df['Company'].str.contains("Ferrari")
#df['Company'].str.cat([" Model"]*df['Company'].shape[0])



Using the built-in function `map` that we've learned in the Functions module, write a script to map the Company's names to their respective country of origin and create a new column to the original dataframe. 

In [72]:
brand_to_country = {'Ferrari':'Italy',
                    'Volkswagen':'Germany',
                    'Porsche':'Germany',
                    'Ford':'USA',
                    'Peugeot':'France',
                    'Renault':'France'}
                    
#df['Company'].map(lambda x: brand_to_country[x.title()])

## adding as a new column
df['Country'] = df['Company'].map(lambda x: brand_to_country[x.title()])
df

# The map method works only for Series.
# We can use "applymap" for DataFrames

# We can apply some method in all columns by using "applymap" or "apply"
#df.select_dtypes("object").apply(lambda x: x.str.title())
#df.select_dtypes("object").applymap(lambda x: x.title())

#Note that applymap is doing a "element-wise" transformation, i.e. it's going colum-by-column and row-by-row and apply the function, that's why we just need to use x.title() because every "x" is only a string object

# If we are using the "apply" we need to put x.str.title because every "x" is a Series.

# We can use both ways, in this case, to solve the same problem. But some times is better to use one instead of the other,

Consider again the previous dataframe:

````
      Company  Quantity  randVals  Country
0     Ferrari        40         0    Italy
1  Volkswagen        30         1  Germany
2     Porsche       120         1  Germany
3  Volkswagen        60         0  Germany
4        Ford        70         1      USA
5        Ford        81         1      USA
6     Ferrari        32         1    Italy
7     Peugeot        53         0   France
8     Renault        62         0   France
````

Write and run the code to replace all values in the randVals column where `randVals = 0` with values of \\(2\\) in df


In [75]:
 

df.randVals.replace(0,2)

Write and Run the code to replace all instances of  'Porsche' with 'Porsche Cayman', all instances of 'USA' with 'United States of America' and all instances of \\(0\\) with \\(10\\) in df

In [77]:
 

df.replace(to_replace = ['Porsche','USA',0]
          ,value      = ['Porsche Cayman','United States of America',10])

# The "replace" method can be used for both Series and DataFrame Objects.
# instead of given two list (one to "look for" and the other "to replace"), we can give a dictionary

df.replace(to_replace = { "Company": ["Porsche"]
                         ,"Country": ["USA"]
                         ,"randVals":[0]}

            , value = { "Company": ["Porsche Cayman"]
                       ,"Country": ["United States of America"]
                       ,"randVals":[10]}
            )


Panda has two tools to help dicretize and binning variables:

#### pd.cut()
```Python
# Easily Create bins in variables using cut
pd.cut(x, bins, right, labels, include_lowest)
```

| --- | --- | --- |
| **Parameters** | **Type** | **Description** |
| x | array-like | Needs to be 1-dimensional |
| bins | list | Each element will define the bin edges |
| right| Bool | Default = ``True``, indicates if the bins will include the rightmost edge |
| labels | optional list | Specify the labels that will return in the bins |
| include_lowest | Bool | Default = ``False``, indicates if the first bins will include the leftmost edge |

#### pd.qcut()
```Python
# Easily create bins quantile based using qcut:
pd.qcut(x, q, labels)
```

| --- | --- | --- |
| **Parameters** | **Type** | **Description** |
| x | array-like | Needs to be 1-dimensional |
| q | int or list | if ``int`` specify number of quantiles to bin the data, also can specify ``list`` of quantiles |
| labels | optional list | Specify the labels that will return in the bins |



Consider the following \\(2\\) lists:

````
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
age_bins = [18, 25, 35, 60, 100]
````

Write and run the code to create a dataframe of binned ages

In [80]:
ages = [20, 22, 25, 27, 21, 23, 
        37, 31, 61, 45, 41, 32]
        
age_bins = [18, 25, 35, 60, 100]

pd.cut(x = ages
      ,bins = age_bins
      # , labels = [1,2,3,4] # this line if you want to put labels for the bins, since the generated bins are pretty long
      )

# 

It is obvious that when we use median the binned distribution is better distributed (more values are in each interval) as opposed to when we use 'mean' (intervals are highly unbalanced) due to the disbalance in the data (there are lots of values below 100 and one single value of 1000). Similar to 'cut' we can use 'qcut', which defines the intervals based in quantiles instead of fixed values. Consider the random distribution below: 

````
data = np.random.randn(10)
````

We can create \\(5\\) equally populated categories using qcut:
````
pd.qcut(data, 5)
````

In [82]:
data = np.random.randn(10)
print(data)
pd.qcut(data, 5)


In Machine Learning, the creation of "dummies" is commonly used. Consider the following dataframe:
````
   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   b
````
`pandas` provides a very useful tool: `get_dummies`:

```Python
pd.get_dummies(x, prefix, drop_first)
```

|---|---|---|
| **Parameters** | **Type** | **Description** |
| x | Series or DataFrame | The data to get dummies |
| prefix | optional str | Add a prefix to all created dummies |
| drop_first | Bool | By default is ``False``. If ``True`` will drop the first level created |


Write a script to generate a dataframe as the one above and create dummies based on the column <i>key</i> to transform the data. Describe what happens to the column 'key'.

In [84]:
 

## ANSWER:
# The 'key' column was one-hot encoded. Every single
# value in the key column became a new column and if the
# corresponding row's value of 'key' matches that column
# value, then a value of 1 is attributed to the result,
# otherwise it is set to 0.

df = pd.DataFrame({'key':['b','b','a','c','a','b'],
                   'data1':range(6)})
df
#pd.get_dummies(df['key'])

In [85]:
pd.get_dummies(df['key'], prefix='KEY_')

Describe a situation where you would want to use dummies.



#### `ANSWER`
When extracting features from a dataset to use in building ML models, it is often useful to transform categorical features into numerical features which are required for most ML algorithms.

Another important technique in data wrangling involves string manipulation. Consider the following example: 
````
a = 'My name is John Doe'
a.split(' ')

>>> ['My', 'name', 'is', 'John', 'Doe']
````
Now, consider the following: 
````
dates = ['01-01-2019','01-02-2019','01-03-2019','01-04-2019']
````
Write a script to separate the first element of the following list of strings by '-' (returned as a list of three elements representing the day, month, and year)



In [89]:
dates = ['01-01-2019','01-02-2019','01-03-2019','01-04-2019']
dates[0].split('-')


Using functional programming, generalize the previous expression to all elements in the list. 

In [91]:
 

list(map((lambda x: x.split('-')),dates))


Write a script to create three separate lists: one list of days, one list of months and one list of years.


In [93]:
days   = list(map((lambda x: x.split('-')[0]),dates))
months = list(map((lambda x: x.split('-')[1]),dates))
years  = list(map((lambda x: x.split('-')[2]),dates))

print(days, months, years)

The method `join` is the opposite operation of `split`. What's the output of this code? 

````
a = 'My name is John Doe'
' '.join(a.split(' ')) == a
````


In [95]:
 




a = 'My name is John Doe'

' '.join(a.split(' ')) == a


Other useful methods include `find`, `replace` and `count`. Predict the output of the following code:
````
a = 'Hi, my name is John Doe'
a.find(',')
````

In [97]:
a = 'Hi, my name is John Doe'
a.find(',')

In [98]:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Date1":[20190101,20190102],"Date2":['02/01/2019','03/01/2019'],"Date3":['12-20-2019', '12-21-2019']})
#pd.to_datetime(df.Date1)
df
df.Date1 = pd.to_datetime(df.Date1, format = '%Y%m%d') #Try change %m to %M
df.Date2 = pd.to_datetime(df.Date2, format = '%d/%m/%Y')  # put origin = '01/01/1900'
df.Date3 = pd.to_datetime(df.Date3, format = '%m-%d-%Y')
df



In [99]:
#df.Date3.dt.dayofweek
#df.Date3.dt.dayofyear
df.Date3.dt.day_name()


In [100]:
# Difference in days between two dates
df['nb_Days'] = df.Date3 - df.Date1

# Difference in moths between two dates
df['nb_months'] = ((df.Date3 - df.Date1)/np.timedelta64(1, 'M')).astype("int64")
df


In [101]:
from pandas.tseries.offsets import MonthEnd, MonthBegin, BDay

# last day of the Month
df['Date3_end'] = pd.to_datetime(df['Date3'], format="%Y%m") + MonthEnd(1)

# First day of the Month
df['Date3_begin'] = pd.to_datetime(df['Date3'], format="%Y%m") + MonthBegin(-1)

df

In [102]:
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
cal = calendar()
holidays = cal.holidays(start=df.Date1.min(), end=df.Date3.max())
holidays


In [103]:
from pandas.tseries.offsets import CustomBusinessDay
weekmask = 'Mon Wed Fri Sat'

bday_cust = CustomBusinessDay(holidays=holidays, weekmask=weekmask)

pd.date_range(df.Date1.min(), df.Date3.max(), freq=bday_cust).size
