# Data Wrangling

The work of a data scientist consists of data wrangling roughly \\(80 \% \\) of the time. It is one of the most difficult steps in the model development process. There are no recipees when it comes to data wrangling, but some best practices help the analyst to intensify the efficiency of the data wrangling process. In most cases, the data that the data scientist needs to analyze does not come from a single source or is contained in a single file. It is very often spread across a variety of files and databases. The `pandas` library provides many powerful tools to join and concatenate datasets, reshape them and to rearrange the data. Some of these features include:

1. Hierarchical Indexation
2. Combining and Merging Datasets
3. Reshaping and Pivoting

#### What is data wrangling? 

The process of data wrangling consists of structuring, aggregation and cleaning of data (some definitions also include data enrichment / characteristic generation). Some common activities where the term data wrangling can be applied include (but are not limited to): gathering data from different sources; understanding the sources and variables of different datasets; cleaning up the duplicates, blanks (filling in the missing values) and reducing errors in the data; joining entire datasets into a single table; variable generation (data enrichment); visualization of the data in order to remove outliers or other undesirable values in the data, etc.; In this module we explore the process of wrangling data using `pandas` and other Python functions.

<b>Note (1):</b> This section is intended to be more "hands-on" so that the user can become more familiar with these techniques.
<b>Note (2):</b> Refer to the `pandas` module to remember some of the syntax.


A very useful method to manipulate and transform a dataframe is `stack()` and `unstack()`. For example, taking the same dataframe that we have shown above, the `unstack()` method reduces the indices and transforms them into columns. See the example below: 


In [2]:
 
import numpy as np
import pandas as pd
df = pd.DataFrame(np.around(np.random.randn(10,5), decimals = 2),
                  index=[['a','a','a','b','b','b','c','c','d','d'],
                         [ 1 , 2 , 3 , 1 , 2 , 3 , 1 , 2 , 1 , 2]])

z.show(df.unstack())

`stack()` is the inverse operation of `unstack()`. What's the output of the following operation?
````
df.unstack().stack()
````
#### `PLEASE DO NOT CODE`


#### `ANSWER`

The original dataframe does not change.

Pandas allows us to reorder a dataframe based on its index. In the case of a multi-index dataframe (as the one previously shown), we can access the first index using `level = 0` (`list('a','b','c','d')`) and the second index using `index = 1` (`list(1,2,3)`).

We can change the arrangement of the data within the dataframe by sorting the index: 
````
df.sort_index(level=1)
````
We can also swap the levels and change the index order:
````
df.swaplevel(0,1)
````


In [6]:
 

df.sort_index(level=1)

# The level is telling pandas wich level need to be sorted
# In this case, it's is sorting the "Second" level (the index at positio)


In [7]:
df.swaplevel(0,1)

# in this method, you need to input the two indeces to be swapped

What's the expected output? 
````
df.swaplevel(0,1).sort_index(level=0)
````

In [9]:
 

df.swaplevel(0,1).sort_index(level=0)

# The first method is swapping the level 0 and 1
# After this, the "older" level 1 is now level 0.
# That's why the level with number is sorted

What's the expected output? 
````
df.sort_index(level = 1)
````

In [11]:
df.sort_index(level = 1)

# The result is the same as the Exercise 2 because the code in Exercise 2 didn't change the DataFrame permanently. 


What's the output? 
````
np.all(df.swaplevel(0,1).sort_index(level = 0).values == df.sort_index(level = 1).values)
````
#### `PLEASE DO NOT CODE`

In [13]:
# The first comparison will bring will compare each element between the two dataframes. If they are equal then it will return a True value (and False otherwise).

# The Second step is the np.all(). It will return a True value if all the comparisons return True. If a single one comparison don`t return a True value, the final result will be False.

# So, the output will be "True"


In [14]:
df.swaplevel(0,1).sort_index(level = 0).values == df.sort_index(level = 1).values


#### `ANSWER`
The returned values of the dataframe are indeed the same. Swapping levels and sorting by level 0 or straight sorting by level 1 is equivalent. The indexation is different, because in the first case we change the indices and in the second we keep them at the same original level. We can calso summarize results by applying calculations to different levels. For instance, the following calculation does the summation of all the values within the specified index level: 
````
df.sum(level=0)
````
The output is shown below.


In [16]:
df.sum(level=0)

# This is doing something very similar as a groupby method.. it's doing a sum by groupby the index level 0

Similarly, one could apply the above expression at the `level = 1`. Can you guess how many rows the output will have given the expression below? 
````
df.sum(level = 1)
````


In [18]:
 

df.sum(level=1)

# Since we have 3 unique values in the index leve = 1, the final result will be a dataframe with 3 rows.

#### `ANSWER`

The final dataframe is a summary of the inner indices \\((1,2,3)\\). When we use `sum(level=1)` we are collapsing the external indices (a,b,c,d) and calculating all that is in between. We can also declare `sum(axis = 1)` or `sum(axis = 0)` in conjunction with `level = 0` or `level = 1`. Let's see what happens with the expression below.
````
df.sum(axis = 0)
````


In [20]:
 

df.sum(axis=0)

# If we put axis = 0, it will sum all the element by column

# if we put axis = 1, it will sum all the elements by row

#df.sum(axis = 1)

Based on the previous output, can you guess what happened? 



#### `ANSWER`
The summation happened for each column (0,1,2,3,4) and the displayed results are the sum of every single value in each column. 


Based on the answer of exercise 7, what is the output of the following? 
````
df.sum(axis = 1)
````


In [24]:
 

df.sum(axis=1)

# It's doing a sum by row.


In [25]:
df

The result of the previous script is the summation of the rows of the dataframe. We can play with this idea and explore it a bit further. What's the output of the following expression? A number or a dataframe? 

````
df.sum(level=0,axis=0)
````


In [27]:
 

df.sum(level = 0, axis = 0)
#It has the same result as df.sum(level = 0)
#because, by default the sum method has axis = 0

#### `ANSWER` 

The result is a dataframe, because it first applies the `sum()` function to the `level  = 0` and then it can't 'collapse' anything else anymore (as there aren't any more levels in `axis = 0`). 

Another possibility to reshape a dataframe is by using its own columns as indices. For example, we could use the columns 0 and 1 as indices as follows: 
````
df.set_index([0,1])
````
When we use `set_index()` we remove the existing indices and the original columns are also removed. We could keep the original columns by simply adding the statement `drop = False`.
````
df.set_index([0,1],drop=False)
````
Alternatively, a commonly used method is `reset_index()`. It is the opposite operation of `set_index()`.

In [30]:
df

In [31]:
df.reset_index().set_index([0,1])

In [32]:
df.set_index([0,1],drop=False)

Can you guess what is the outcome of the following expression (remember that our dataframe is still the same we started with)?
````
df.reset_index(level = 1)
````

In [34]:
df.reset_index(level = 1)
# the reset_index method will transfer the index to a column.
# If the dataframe doesn't have any other level index it will create a new one ine following format:
# RangeIndex(start = 0, stop = n, step 1), where n = # rows.


What if we use the following expression instead? 
````
df.reset_index(level = 0)
````


In [36]:
 

df.reset_index(level = 0)


If we don't declare the level we want to reset, what is the output? 
````
df.reset_index()
````

In [38]:
 

df.reset_index()

# If you don't tell pandas which index you want to reset, it will reset all of them
# You can have the same result by doing df.reset_index(level = [0,1])
# Pandas will automatically create a new index

Observe that simply using `reset_index()` returns all the indices to the dataframe as columns. We could now work with the original indices as columns instead of rows. The method also accepts `inplace`, so, if one wants to permanently change a dataframe, it is very simple to do it. Knowing that reset_index() returns the indices to the columns, what number is printed in the following code? 
````
df.reset_index().loc[2,0]
````


In [40]:
 

df.reset_index().loc[2,0]


For this section, let's consider the two following dataframes: 
````
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
````

In [42]:
 

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b','a', 'd'],'data2': range(4)})


In [43]:
df1

In [44]:
df2








Knowing that `merge()` joins two distinct dataframes, what do you think is the output of the following expression? 

````
import pandas as pd

pd.merge(df1,df2)
````

In [46]:
pd.merge(df1,df2)

What happened in the script above?



#### `ANSWER`

Because we didn't specify which 'key' we wanted to merge with, `pandas` does the merging using the overlapping column names as keys. We could, instead, use the following script to do the same work. 
````
pd.merge(df1, df2, on='key')
````


If the column names are different between dataframes \\(1\\) and \\(2\\), we can use `left_on` and `right_on` instead of simply `on` to specify the column where we want to merge. Write the expression that combines df1 and df2 if instead of 'key', each had the same column named as 'key1' and 'key2', respectively. 






In [50]:
df3 = pd.DataFrame({'key1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df4 = pd.DataFrame({'key2': ['a', 'b', 'd'],'data2': range(3)})

import pandas as pd
pd.merge(df3,df4,left_on = 'key1', right_on='key2')


By default, `merge()` performs an <i>inner join</i>. If we want to do other types of join, we have to specify using `how`. Can you guess the `pandas` expression for a left join between df1 and df2? What about an outer join? 




In [52]:
import pandas as pd
pd.merge(df1,df2,on='key',how='left')
pd.merge(df1,df2,on='key',how='outer')


Consider the following dataframes: 
````
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],'key2': ['one', 'two', 'one'],'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],'key2': ['one', 'one', 'one', 'two'],'rval': [4, 5, 6, 7]})
````
If we want to do an <i>outer</i> join between two datasets, we specify `how = 'outer'` within the method `merge()`. Also, if we want to join using two simultaneous keys, we can pass them as a list `(['key1','key2'])`. Knowing this, what would be the command to do a left join using both keys on the datasets we've just defined (left, right)? 



In [54]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
                     
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})
                      
pd.merge(left,right,how='left',on=['key1','key2'])

In [55]:
left1  = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])



In [56]:
left1  = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],'value': range(6)})
left1



In [57]:
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

right1







 

We can also `merge()` by the index instead of using columns. Suppose we have the following datasets: 

````
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],'value': range(6)})

right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b']) 
````

If we apply the following command, the resulting dataframe will have a missing value. Why? 

````
pd.merge(left1, right1, left_on = 'key', right_index = True, how = 'outer')
````


In [59]:
#Answer: 
# In the example above, we're joining a dataframe of column 'key', 
# with the unique values of 'a','b' and 'c' with the index of a dataframe 
# with unique values of 'a' and 'b'. Therefore, the line corresponding to 
# the key 'c' will be missing. 

left1  = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

pd.merge(left1, right1, left_on = 'key', right_index = True, how = 'outer')


 

It is also possible to merge using indexes on both side. Suppose we have the following dataframes: 
````
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])
left2  = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'], columns=['Ohio', 'Nevada'])
````
Based on your previous knowledge, what would be the command to <i>outer</i> join 'left2' and 'right2' by their indexes? 

In [61]:
 

right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], 
                        index=['b', 'c', 'd', 'e'], 
                        columns=['Missouri', 'Alabama'])
                        
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], 
                        index=['a', 'c', 'e'], 
                        columns=['Ohio', 'Nevada'])
                        
pd.merge(left2, right2, how = 'outer', left_index = True, right_index = True)


Another method is `join`, which allows a 'merge' to be done by 'index'. It follows a basic syntax, such as : 
````
dataset_1.join(dataset_2, how='method')
````
The default `join` operation performs a <i>left join</i>. What's then the correct syntax to perform a `left join` between left2 and right2 dataframes? Which row of right2 would disappear?


In [63]:
 

# Answer: 
# The rows with indexes 'b' and 'd'.
left2.join(right2)





Instead of simply using `merge` or `join` to join two or more dataframes, we could also concatenate them along an axis. For example, we could concatenate the following two Series into one Series following this general procedure: 

````
import pandas as pd

Series1 = pd.Series([10,20],index=['x0','x1'])
Series2 = pd.Series([30,40],index=['x2','x3'])

result = pd.concat([Series1,Series2])

>>>
x0    10
x1    20
x2    30
x3    40
dtype: int64
````

In [65]:
left2

In [66]:
right2

Concatenate the previous dataframes left2 and right2 into a single final dataframe with unique indices and show the correct script.


In [68]:
pd.concat( [left2,right2] ,axis=1)



What happens if we change the `axis` information in the `concat` method? 


In [70]:
# Answer: We get a pilled dataframe (remember, 
# axis = 0 is not necessary to indicate, as axis = 0 is default). 

pd.concat([left2,right2],axis=0) 


Use the `join = 'inner'` method and modify the command of Exercise 23. What's the final result? Why did this happen? 


In [72]:
 

# Answer: produces a dataframe with only two rows and NO missing value. 
# This happens because 'inner' joins always look for information that 
# perfectly matches both dataframes.

pd.concat([left2,right2],axis=1,join='inner') 




As previously discussed, we will get back now to the first methods we saw: `stack` and `unstack`. Basically, `stack` rotates the columns in the data and transforms them into rows and `unstack` does the opposite. Let's consider the following dataset:

In [74]:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(6).reshape((2, 3)),
                  index=pd.Index(['Ohio', 'Colorado'], name='state'),
                  columns=pd.Index(['one', 'two', 'three'],name='number'))

df



What is the result of the `stack` operation in this dataset? 


In [76]:
 

# Answer: It produces a 'pandas' series (the columns flip to the rows)
df = df.stack()
df




Two very important methods in `pandas` are `pivot` and `melt`. Similarly to `stack` and `unstack`, `pivot` and `melt` are inverse operations in a dataframe. Let's consider the following dataframe: 

In [78]:
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                   'A': [1, 2, 3],
                   'B': [4, 5, 6],
                   'C': [7, 8, 9]})
df



The output of the following `df.pivot('key','A')`, yields: 
````
       B              C          
A      1    2    3    1    2    3
key                              
bar  NaN  5.0  NaN  NaN  8.0  NaN
baz  NaN  NaN  6.0  NaN  NaN  9.0
foo  4.0  NaN  NaN  7.0  NaN  NaN
````

What's the command to apply `pivot` to our previous dataframe such that we have 'B' values as rows and 'C' values as columns? 


In [80]:
 

df.pivot('B','C')

# The first input of the pivot method is telling what will be the row index values (in this case are all the values from column B)
# The second input of the pivot method is telling what will be column names values (in this case are all the values from column C)
# The "key" and "A" columns wii be tranfer to the columns level = 0



Using `pivot`, a great amount of analysis can be performed (we will later explore these methods using heatmaps). A third parameter can be added to the `pivot` method, in order to explicitly define which values we want to observe in our "pivoted" table. The general syntax is: 

````
pivot(index=None, columns=None, values=None) 
````



What is the output of the following code? 
````
df.pivot('A','B','C')
````


In [83]:
 

## The diagonal values are the 'C' values corresponding to the
# intersection between 'A' and 'B' columns

df.pivot('A','B','C')

# The first input will be the rows index values
# The second input will be the columns names
# The third input will be the values to appear in the result



What about this?
````
df.pivot('key','B','key')
````

In [85]:
df.pivot('key','B','key')

# In this case:
# the row index values are the values from key column
# the column names are the values from B column
# the values that are appearing are from key column



Let's see what happens when we apply `melt` to our dataframe. 

In [87]:
pd.melt(df,['key'])

#The first input is the dataframe we want to melt
# The second element is the variables who will be the IDs of the new dataframe
# all the others will be "melted"
# i.e., all the columns names (except the ones in the second input) will be "unpivoted" to the row axis
# This ways, two new columns are created (in place of all others): "variable" and "value"

In [88]:
df



What's the result of the following expression?
````
pd.melt(df,['key']).pivot('key','variable','value')
````

In [90]:
## pivot and melt are opposite operations

pd.melt(df,['key']).pivot('key','variable','value')



`melt` can also be used in multiple columns. Write an expression to "melt" two columns and to compute the values of two other columns simulteaneously.


In [92]:

#Answer: Any combination of two would suffice this inquiry, for example: 
pd.melt(df,['key','A'],['B','C'])
pd.melt(df,['key','B'],['A','C'])
