# Lecture 7 : Data Wragling

* You will learn how to work with messy data: extract, clean, and deal with invalid or missing values. 
* Data manipulation using Pandas package.

In [1]:
import pandas as pd
import numpy as np

## 1. Missing data

* Missing data is common in most data analysis applications.

In [2]:
fruits = pd.Series(['apple', 'banana', np.nan, 'orange'])

In [4]:
fruits[fruits.isna()]

2    NaN
dtype: object

### Filtering out missing data

In [5]:
data = pd.Series([1, np.nan, 3, np.nan, 7])

In [8]:
data = data.fillna(99)

In [9]:
data

0     1.0
1    99.0
2     3.0
3    99.0
4     7.0
dtype: float64

* **dropna** returns the Series with only the non-null data and index values.

In [10]:
data = pd.Series([1, np.nan, 3, np.nan, 7])
data.dropna(inplace=True)

In [11]:
data

0    1.0
2    3.0
4    7.0
dtype: float64

* With DataFrame objects, you may want to drop rows or columns which are all NA or just those containing any NAs.
* **dropna** by default drops any row containing a missing value.

In [12]:
data = pd.DataFrame([[1,6,3],[1,np.nan,np.nan],[np.nan,np.nan,np.nan],[np.nan,5.5,3.0]])

In [14]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.0,3.0


* Passing **how=‘all’** will only drop rows that are all NA.

In [15]:
# Passing how=‘all’ will only drop rows that are all NA.
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.0,3.0
1,1.0,,
3,,5.5,3.0


* Dropping columns in the same way is only a matter of passing **axis=1**.

In [16]:
data['4']=np.nan

In [23]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.0,3.0
1,1.0,,
2,,,
3,,5.5,3.0


* You can keep only rows containing a certain number of observations with thresh argument.

In [25]:
####
# data.clip(3,5)
data.dropna(thresh=3)

Unnamed: 0,0,1,2,4
0,1.0,6.0,3.0,


### Filling in missing data

* Rather than filtering out missing data, you may want to fill in the “holes” in any number of ways.
* Calling fillna with a constant replaces missing values with that value.

* Calling **fillna** with a dict you can use a different fill value for each column.

## 2. Merging data

* Merge or join operations combine data sets by linking rows using one or more keys.

<img src="https://i.stack.imgur.com/YvuOa.png" style="height:300px" align="left">

In [None]:
df1 = pd.DataFrame({'name': ['b', 'a', 'c', 'd'], 'hw1': range(4)})
df2 = pd.DataFrame({'name': ['a', 'b', 'd'], 'hw2': range(3)})

* You probably noticed that the 'c' values and associated data are missing from the result. By default merge does an **inner** join.
* Other possible options are **left**, **right**, and **outer**.

<img src="https://i.stack.imgur.com/BECid.png" style="height:300px" align="left">

<img src="https://i.stack.imgur.com/8w1US.png" style="height:300px" align="left">

<img src="https://i.stack.imgur.com/euLoe.png" style="height:300px" align="left">

### Concatenating along an axis

* Calling **concat** with these object in a list glues together the values and indexes:

In [None]:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

* You can get more information on https://pandas.pydata.org/pandas-docs/version/1.0.3/user_guide/merging.html.

## 3. Data Transformation
### Removing duplicates

* Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [None]:
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [1, 1, 2, 3, 3, 4, 4]})

* **duplicated** method returns a boolean Series indicating whether each row is a duplicate or not:

* **drop_duplicates** returns a DataFrame where the duplicated array is True:

### Transforming data using a function or mapping

* For many data sets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame.

In [None]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'pastrami', 'corned beef', 'bacon', 'pastrami', 
                              'honey ham', 'nova lox'], 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

* Suppose you wanted to add a column indicating the type of animal that each food came from.

In [None]:
meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 
                  'honey ham': 'pig', 'nova lox': 'salmon'}

* Using **map** is a convenient way to perform element-wise transformations and other data cleaning-related operations.

### Replacing values

* **replace** provides a simpler and more flexible way to replace values.

In [None]:
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [1, 1, 2, 3, 3, 4, 4]})

* If you want to replace multiple values at once, you instead pass a list then the substitute value:

* The argument passed can also be a dictionary:

### Renaming columns

In [None]:
df1 = pd.DataFrame({'name': ['b', 'a', 'c', 'd'], 'hw1': range(4)})
df2 = pd.DataFrame({'NAME': ['a', 'b', 'd'], 'hw2': range(3)})

### Sort values

* Pandas data frame has two useful functions

    * **sort_values()**: to sort pandas data frame by one or more columns
    * **sort_index()**: to sort pandas data frame by row index

* Each of these functions come with numerous options, like sorting the data frame in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm etc.

In [None]:
data_BM = pd.read_csv('bigmart_data.csv')

######## For Colab users ########
#import io
#from google.colab import files
#uploaded = files.upload()
#data_BM = pd.read_csv(io.StringIO(uploaded['bigmart_data.csv'].decode('utf-8')))

* Suppose you want to sort the dataframe by "Outlet_Establishment_Year" then you will use **sort_values**

- Now `sort_values` takes multiple options like:
    - `ascending`: The default sorting order is ascending, when you pass False here then it sorts in descending order.
    - `inplace`: whether to do inplace sorting or not

* You might want to sort a dataframe based on the values of multiple columns. We can specify the columns we want to sort by as a list in the argument for sort_values().


## 4. Aggregating

- In the given data set, you may want to find out **what is the mean price for each item type**?
- You can use **groupby()** to achieve this.
- The first step would be to group the data by Item_Type column.

* You can use `groupby` with **multiple** columns of the dataset too. 
* In this case, if you want to group first based on the Item_Type and then Item_MRP you can simply pass a list of column names.

## Exercise

1. How many item types ('Item_Type')?

In [None]:
# your code here



2. How many low fat contents in the item 'Meat'?

In [None]:
# your code here



3. What is the top selling ('Item_Outlet_Sales') item ('Item_Type') on average?

In [None]:
# your code here



## References
* Chen, D. Y. (2017). Pandas for everyone: Python data analysis. Addison-Wesley Professional.
* Data Analysis with Python: https://www.coursera.org/learn/data-analysis-with-python
* https://pandas.pydata.org/