In [1]:
import pandas as pd

# **Manipulating data**
In this notebook we'll go over ways to edit the data. In particular we'll see how to:
- renaming columns/indexes
- adding/removing column/row
- combining dataframes
- parsing: astype, pd.to_datetime
- remove duplicate
- apply, map, applymap
- transposing

## Copy vs Inplace
It is important to understand when you are editing the original DataFrame and when you are just getting a edited copy of it, leaving the original untouched.

Some methods allow for both by using the `inplace` parameter. Other times you might want to get a `df.copy()` before you start editing the loaded data.

## Renaming columns/indexes
The simplest way is to just overwrite the column attibute with the new names

In [2]:
df = pd.DataFrame({'Name': ['Bob', 'Mary'], 'Cats':[2,0], 'Dogs':[1,2]})
df

Unnamed: 0,Name,Cats,Dogs
0,Bob,2,1
1,Mary,0,2


In [3]:
df.columns = range(3)
df

Unnamed: 0,0,1,2
0,Bob,2,1
1,Mary,0,2


In [4]:
# same goes for indexes
df.index = range(3,5)
df

Unnamed: 0,0,1,2
3,Bob,2,1
4,Mary,0,2


### Rename method
To rename only a subset of the index/columns you can use the rename method which expects a dictionary of old names as keys and new names as values.

Specify the axis or, even better, specify which columns and/or index you want to rename.

***Note**: a renamed copy will be returend by the method. To have the original data edited use the parameter `inplace=True`

In [5]:
df = pd.DataFrame({'Name': ['Bob', 'Mary'], 'Cats':[2,0], 'Dogs':[1,2]})
df

Unnamed: 0,Name,Cats,Dogs
0,Bob,2,1
1,Mary,0,2


In [6]:
df.rename(columns={'Cats': 'Felines', 'Name': 'Appellative'})

Unnamed: 0,Appellative,Felines,Dogs
0,Bob,2,1
1,Mary,0,2


## Adding Elements
- Column: `df[colname]=[..data..]` will add a column at the end by assiging the data to a column, if column already present it will be overwritten
- Row: `df.append(x)` will add a row at the end, the row to add should be in the form of a pd.Series

**NOTE**: append returns a edited copy, while the colname assignment this will happen inplace.

In [7]:
# row
new_row = pd.Series(['Jack', 1,2,456], index=['Name','Cats','Dogs','new_col'], name='new_row')
df = df.append(new_row)

# column
df['new_col'] = 123   # the number will be broadcasted to all rows

df

Unnamed: 0,Name,Cats,Dogs,new_col
0,Bob,2,1,123
1,Mary,0,2,123
new_row,Jack,1,2,123


## Removing Elements
Both for columns and rows, this can be done by using `df.drop()` and specifying the axis.

In alternative the use of `index` or `columns` parameters can be use, and is more readable.

Use ```inplace=True``` if you want to edit the object, instead of passing a copy.

In [8]:
copy_of_data = df.drop(index='new_row', columns='new_col')  # returns a copy

display(copy_of_data)  # edited copy
display(df)  # still the same

Unnamed: 0,Name,Cats,Dogs
0,Bob,2,1
1,Mary,0,2


Unnamed: 0,Name,Cats,Dogs,new_col
0,Bob,2,1,123
1,Mary,0,2,123
new_row,Jack,1,2,123


In [9]:
df.drop(index='new_row', columns='new_col', inplace=True)  # returns a copy

df

Unnamed: 0,Name,Cats,Dogs
0,Bob,2,1
1,Mary,0,2


### ***EXERCISE 7.1***
Take the Dataframe provided below and add a column called 'hello' containing a list of 3 greetings of your choice.
Then rename it to 'greetings'.

In [10]:
df = pd.DataFrame({'Name': ['Bob', 'Mary', 'John'], 'Cats':[2,0,3], 'Dogs':[1,2,4]})
# insert solution here

### Combining Dataframes
There are mainly three methods to combine dataframes: `concat`, `merge` and `join`
They have slightly different strenghts and you can read more about them in the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html).

For this tutorial we'll just focus on `pd.concat`, which can be used to combine multiple DataFrames together, both horizontally and vertically by specifying the axis.

In [11]:
df1 = pd.DataFrame({'col_num': [1,2,3], 'col_letter':['a', 'b', 'c']})
df2 = pd.DataFrame({'col_num': [4,5,6], 'col_letter':['d', 'e', 'f']})

display(df1, df2)

print('combining vertically..')
df_v = pd.concat([df1, df2], axis='index')
display(df_v)

print('combining horizontally..')
df_h = pd.concat([df1, df2], axis='columns')
display(df_h)

Unnamed: 0,col_num,col_letter
0,1,a
1,2,b
2,3,c


Unnamed: 0,col_num,col_letter
0,4,d
1,5,e
2,6,f


combining vertically..


Unnamed: 0,col_num,col_letter
0,1,a
1,2,b
2,3,c
0,4,d
1,5,e
2,6,f


combining horizontally..


Unnamed: 0,col_num,col_letter,col_num.1,col_letter.1
0,1,a,4,d
1,2,b,5,e
2,3,c,6,f


***NOTE***: 

The common axis are matched automatically even if out-of-order. If, in the common axis, there is any missing label in one of the Dataframes, the label is still kept and NaNs are used as values for the lacking Dataframe. See the example below.

In [12]:
df3 = pd.DataFrame({'col_colors': ['r', 'b', 'g'], 'col_names':['john', 'bob', 'mary']}, index=[1,0,4])
display(df3)

print('..has index out-of-order, misses index 3 and has extra index 4.\nSo combining it with df1 gives..')
pd.concat([df1, df3], axis='columns')

Unnamed: 0,col_colors,col_names
1,r,john
0,b,bob
4,g,mary


..has index out-of-order, misses index 3 and has extra index 4.
So combining it with df1 gives..


Unnamed: 0,col_num,col_letter,col_colors,col_names
0,1.0,a,b,bob
1,2.0,b,r,john
2,3.0,c,,
4,,,g,mary


## Parsing
Sometimes the data doesn't come in the right datatype, for example a data given as a string instead of a number.
Some easy conversion can be done by using the `.astype(..)` method which attempts to convert all provided data to the set datatype

In [13]:
df = pd.DataFrame({'numbers': ['5', '7', '4'], 'switches':[1,0,0]})

print(df.dtypes)
df

numbers     object
switches     int64
dtype: object


Unnamed: 0,numbers,switches
0,5,1
1,7,0
2,4,0


In [14]:
df['numbers'] = df['numbers'].astype(int)
df['switches'] = df['switches'].astype(bool)

print(df.dtypes)
df

numbers     int32
switches     bool
dtype: object


Unnamed: 0,numbers,switches
0,5,True
1,7,False
2,4,False


### parsing dates
A common issue when dealing with dates is converting them correctly so to be able to do operations on them.
pandas provides a useful function to_datetime with handy parameters to customise the conversion.

In [15]:
df = pd.DataFrame({'string_dates': ['05_04_2016', '15_07_2015']})

print(df.dtypes)
df

string_dates    object
dtype: object


Unnamed: 0,string_dates
0,05_04_2016
1,15_07_2015


In [16]:
df['string_dates'] = pd.to_datetime(df['string_dates'], format='%d_%m_%Y')

print(df.dtypes)
df

string_dates    datetime64[ns]
dtype: object


Unnamed: 0,string_dates
0,2016-04-05
1,2015-07-15


### Trasposing
Trasposing is switching rows and columns and it can be done simply by using `df.T`

***NOTE***: if a dataframe has mixed datatypes, transposing will convert everything to object type so you'll need to re-parse it.

In [17]:
df = pd.DataFrame({'A': [1,1,2], 'B':['ciao', 'hello', 'hi']})

print(df.dtypes)
df

A     int64
B    object
dtype: object


Unnamed: 0,A,B
0,1,ciao
1,1,hello
2,2,hi


In [18]:
print(df.T.dtypes)
df.T

0    object
1    object
2    object
dtype: object


Unnamed: 0,0,1,2
A,1,1,2
B,ciao,hello,hi


### Removing duplicates
when removing duplicates it is important to understand what kind of duplication we want to remove:
- remove duplicate data (values in the rows are the same)
- remove duplicate indexes/columns

#### remove duplicate data
To remove duplicated data (even with different indexes) we can use the handy `df.drop_duplicates()`

In [19]:
df = pd.DataFrame({'A': [1,1,2], 'B':[2,2,3]})
df

Unnamed: 0,A,B
0,1,2
1,1,2
2,2,3


In [20]:
df.drop_duplicates()

Unnamed: 0,A,B
0,1,2
2,2,3


#### remove duplicate cols/rows
To remove rows or columns with same labels we have to use the subsetting and the property `.duplicated()` which returns a mask where True represents a duplicated item.

In [21]:
df = pd.DataFrame({'A': [1,2,3], 'A': [3,4,5], 'B':[6,7,8]}, index=['C','D','C'])
df

Unnamed: 0,A,B
C,3,6
D,4,7
C,5,8


In [22]:
df.index.duplicated()

array([False, False,  True])

In [23]:
df = df.loc[~df.index.duplicated()]   # rows
df = df.loc[:, ~df.columns.duplicated()]   # columns
df

Unnamed: 0,A,B
C,3,6
D,4,7


### ***EXERCISE 7.2***
Parse the dates and remove the duplicates timestamps of the data below

In [24]:
df = pd.DataFrame({i: pd.np.random.randn(6) for i in range(3)}, index=[f'D0{i}M10Y2016' for i in (3,4,3,4,5,3)])
# insert solution here

## Apply custom function to the data
In order to use a function on data inside the DataFrame we can use the `apply` method.
It uses the function along the series provided, column (axis=0) or row (axis=1)

If you need to apply a element-wise custom function to the whole DataFrame, you can use `applymap`. This is useful if the function is not vectorised, i.e. does not allow for an iterable as an input but just a single number.

In [25]:
df = pd.DataFrame({'A': [1,2,3], 'A': [3,4,5], 'B':[6,7,8]}, index=['C','D','E'])
df

Unnamed: 0,A,B
C,3,6
D,4,7
E,5,8


In [26]:
df['A'].apply(pd.np.square)  # apply to a single column
df.apply(pd.np.square)  # apply to a whole dataframe (only works if function is vectorised)

Unnamed: 0,A,B
C,9,36
D,16,49
E,25,64


In [27]:
def square_if_even(x):  # non-vectorised function
    return x**2 if x%2==0 else x
    
df['A'].apply(square_if_even)  # apply to a single column
df.applymap(square_if_even)  # apply to the whole dataframe element-wise

Unnamed: 0,A,B
C,3,36
D,16,7
E,5,64


### Apply a function that needs multiple columns as input
If you want to apply a function for each row that uses data in multiple columns you can use the `axis=1` flag and map the parameters using a lambda expession.

In [28]:
df = pd.DataFrame({i: pd.np.random.randn(6) for i in ['a','b','c']})

def my_function(a, c):
    return a+c
    
df['my_result'] = df.apply(lambda row: my_function(a=row['a'], c=row['c']), axis=1)
df

Unnamed: 0,a,b,c,my_result
0,0.102182,-0.086612,-0.012026,0.090156
1,-0.277075,1.435563,-0.381969,-0.659044
2,1.926825,-1.018924,0.287467,2.214293
3,0.141198,0.701604,1.101626,1.242824
4,-0.776151,-0.450205,-0.493446,-1.269597
5,-0.166641,0.41696,-2.137159,-2.3038


If your function is simple enough you can also created with having a row as input in mind. This allows to avoid the use of lambda functions.

*Only do this if the function is simple* as for more complex function is better to have them more testable and requesting only the inputs needed in a more structured form

In [29]:
def my_simple_function(row):
    return row['a']+row['c']
    
df['my_simple_result'] = df.apply(my_simple_function, axis=1)
df

Unnamed: 0,a,b,c,my_result,my_simple_result
0,0.102182,-0.086612,-0.012026,0.090156,0.090156
1,-0.277075,1.435563,-0.381969,-0.659044,-0.659044
2,1.926825,-1.018924,0.287467,2.214293,2.214293
3,0.141198,0.701604,1.101626,1.242824,1.242824
4,-0.776151,-0.450205,-0.493446,-1.269597,-1.269597
5,-0.166641,0.41696,-2.137159,-2.3038,-2.3038


### ***EXERCISE 7.3***
create and apply a function over the rows that returns the squared number of cats if the number of dogs is even, 0 otherwise

In [30]:
df = pd.DataFrame({'Name': ['Bob', 'Mary', 'John'], 'Cats':[2,0,3], 'Dogs':[1,2,4]})
# insert solution here