# Big Data Real-Time Analytics with Python and Spark

## Chapter 3 - Data Manipulation in Python with Pandas
- Documentation: https://pandas.pydata.org/
- Create a new column without Broadcasting at the end and in a specific place
- Create a new column with Broadcasting at the end and in a specific place
- Delete one column
- Change the name of the columns name
- Set an ID as a index stating from 1 before delivery it to the decision makers

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.8


In [2]:
# Import the pandas module
import pandas as pd

In [3]:
# package version used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversion

Author: Bianca Amorim

pandas: 1.5.0



## Adding of Columns and Index, With and Without Broadcasting

### Without Broadcasting

In [5]:
# Loads a file to disk and stores it as a dataframe
df = pd.read_csv('datasets/dataset1.csv')

In [6]:
df.head()

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation
0,ATL,Almiron,Miguel,M,1912500.0,2297000.0
1,ATL,Ambrose,Mikey,D,65625.0,65625.0
2,ATL,Asad,Yamil,M,150000.0,150000.0
3,ATL,Bloom,Mark,D,99225.0,106573.89
4,ATL,Carleton,Andrew,F,65000.0,77400.0


In [7]:
# Adding a new column 
# With slincing (when the name of the columns does not exist), only assigning an value
df['final_salary'] = 0

In [8]:
df.head()

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation,final_salary
0,ATL,Almiron,Miguel,M,1912500.0,2297000.0,0
1,ATL,Ambrose,Mikey,D,65625.0,65625.0,0
2,ATL,Asad,Yamil,M,150000.0,150000.0,0
3,ATL,Bloom,Mark,D,99225.0,106573.89,0
4,ATL,Carleton,Andrew,F,65000.0,77400.0,0


In [9]:
# Fill the column value from other varibles
df['final_salary'] = df['base_salary'] + df['guaranteed_compensation']

**Note:** We could put the instruction above, when we create the column, but is interesting create the column first with an arbitrary value and then fill it. Because that way we have a better control in case it generates a NA value. For example, in this case, if after I fill it, it still have any 0 value, means that there was some error in the operation. **This is only a different strategy to be able to better observe the data** 

In [10]:
df.head()

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation,final_salary
0,ATL,Almiron,Miguel,M,1912500.0,2297000.0,4209500.0
1,ATL,Ambrose,Mikey,D,65625.0,65625.0,131250.0
2,ATL,Asad,Yamil,M,150000.0,150000.0,300000.0
3,ATL,Bloom,Mark,D,99225.0,106573.89,205798.89
4,ATL,Carleton,Andrew,F,65000.0,77400.0,142400.0


In [11]:
# Another method to add a column (With the insert method, I can say the position)
df.insert(0, column = "ID",  value = range(1, 1 + len(df)))

In [12]:
df.head()

Unnamed: 0,ID,club,last_name,first_name,position,base_salary,guaranteed_compensation,final_salary
0,1,ATL,Almiron,Miguel,M,1912500.0,2297000.0,4209500.0
1,2,ATL,Ambrose,Mikey,D,65625.0,65625.0,131250.0
2,3,ATL,Asad,Yamil,M,150000.0,150000.0,300000.0
3,4,ATL,Bloom,Mark,D,99225.0,106573.89,205798.89
4,5,ATL,Carleton,Andrew,F,65000.0,77400.0,142400.0


In [13]:
df.tail()

Unnamed: 0,ID,club,last_name,first_name,position,base_salary,guaranteed_compensation,final_salary
610,611,VAN,Teibert,Russell,M,126500.0,194000.0,320500.0
611,612,VAN,Tornaghi,Paolo,GK,80000.0,80000.0,160000.0
612,613,VAN,Waston,Kendall,D,350000.0,368125.0,718125.0
613,614,,,,,,,
614,615,VAN,Williams,Sheanon,D,175000.0,184000.0,359000.0


### With Broadcasting

> **Broadcasting** is the propagation of an operation along the dataframe

In [14]:
# This instruction does not change the original dataframe. They use other area of the memory
df['base_salary'].add(5)

0      1912505.0
1        65630.0
2       150005.0
3        99230.0
4        65005.0
         ...    
610     126505.0
611      80005.0
612     350005.0
613          NaN
614     175005.0
Name: base_salary, Length: 615, dtype: float64

In [15]:
# You can see that the original was not changed
df.base_salary.head()

0    1912500.0
1      65625.0
2     150000.0
3      99225.0
4      65000.0
Name: base_salary, dtype: float64

In [16]:
# To modified the original we have to use this instruction
df['base_salary'] = df['base_salary'].add(5)

In [17]:
df.base_salary.head()

0    1912505.0
1      65630.0
2     150005.0
3      99230.0
4      65005.0
Name: base_salary, dtype: float64

In [19]:
df.head()

Unnamed: 0,ID,club,last_name,first_name,position,base_salary,guaranteed_compensation,final_salary
0,1,ATL,Almiron,Miguel,M,1912505.0,2297000.0,4209500.0
1,2,ATL,Ambrose,Mikey,D,65630.0,65625.0,131250.0
2,3,ATL,Asad,Yamil,M,150005.0,150000.0,300000.0
3,4,ATL,Bloom,Mark,D,99230.0,106573.89,205798.89
4,5,ATL,Carleton,Andrew,F,65005.0,77400.0,142400.0


In [20]:
# Adding a new column using Broadcasting (Convert USD to EURO)
# Exchange rate of the day = 0,92
df['base_salary_eur'] = df['base_salary'].mul(0.92)

In [21]:
# The columns that we create went to the final
df.head()

Unnamed: 0,ID,club,last_name,first_name,position,base_salary,guaranteed_compensation,final_salary,base_salary_eur
0,1,ATL,Almiron,Miguel,M,1912505.0,2297000.0,4209500.0,1759504.6
1,2,ATL,Ambrose,Mikey,D,65630.0,65625.0,131250.0,60379.6
2,3,ATL,Asad,Yamil,M,150005.0,150000.0,300000.0,138004.6
3,4,ATL,Bloom,Mark,D,99230.0,106573.89,205798.89,91291.6
4,5,ATL,Carleton,Andrew,F,65005.0,77400.0,142400.0,59804.6


In [22]:
# Delete one column
# We have to use inplace = True to modify in original dataframe
df.drop(columns = ['base_salary_eur'], inplace = True)

In [23]:
df.head()

Unnamed: 0,ID,club,last_name,first_name,position,base_salary,guaranteed_compensation,final_salary
0,1,ATL,Almiron,Miguel,M,1912505.0,2297000.0,4209500.0
1,2,ATL,Ambrose,Mikey,D,65630.0,65625.0,131250.0
2,3,ATL,Asad,Yamil,M,150005.0,150000.0,300000.0
3,4,ATL,Bloom,Mark,D,99230.0,106573.89,205798.89
4,5,ATL,Carleton,Andrew,F,65005.0,77400.0,142400.0


In [25]:
# Create the column in a specific place use insert (We will put near of the USD price)
# We put the broadcasting as a value
df.insert(6, column = 'base_salary_eur', value = df['base_salary'].mul(0,92))

In [26]:
df.head()

Unnamed: 0,ID,club,last_name,first_name,position,base_salary,base_salary_eur,guaranteed_compensation,final_salary
0,1,ATL,Almiron,Miguel,M,1912505.0,0.0,2297000.0,4209500.0
1,2,ATL,Ambrose,Mikey,D,65630.0,0.0,65625.0,131250.0
2,3,ATL,Asad,Yamil,M,150005.0,0.0,150000.0,300000.0
3,4,ATL,Bloom,Mark,D,99230.0,0.0,106573.89,205798.89
4,5,ATL,Carleton,Andrew,F,65005.0,0.0,77400.0,142400.0


In [27]:
# Rename all the columns name with dictionary
df.rename(columns = {"base_salary": "base_salary_usd",
         "guaranteed_compensation": "guaranteed_compensation_usd",
         "final_salary": "final_salary_usd"},
         inplace = True)

In [28]:
df.head()

Unnamed: 0,ID,club,last_name,first_name,position,base_salary_usd,base_salary_eur,guaranteed_compensation_usd,final_salary_usd
0,1,ATL,Almiron,Miguel,M,1912505.0,0.0,2297000.0,4209500.0
1,2,ATL,Ambrose,Mikey,D,65630.0,0.0,65625.0,131250.0
2,3,ATL,Asad,Yamil,M,150005.0,0.0,150000.0,300000.0
3,4,ATL,Bloom,Mark,D,99230.0,0.0,106573.89,205798.89
4,5,ATL,Carleton,Andrew,F,65005.0,0.0,77400.0,142400.0


In [29]:
# Convert the ID column as a index (Because to everyone is normal the first item to be 1)
# Is interesting to change the index before delivery the dataset to use in another tool
# Internaly the index is still is 0 when we use here. But change it only if it will leave the python enviroment.
# The decision makers using in other programes like Power BI, they will see the ID column with the 1
df.set_index('ID', inplace = True)

In [30]:
df.head()

Unnamed: 0_level_0,club,last_name,first_name,position,base_salary_usd,base_salary_eur,guaranteed_compensation_usd,final_salary_usd
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,ATL,Almiron,Miguel,M,1912505.0,0.0,2297000.0,4209500.0
2,ATL,Ambrose,Mikey,D,65630.0,0.0,65625.0,131250.0
3,ATL,Asad,Yamil,M,150005.0,0.0,150000.0,300000.0
4,ATL,Bloom,Mark,D,99230.0,0.0,106573.89,205798.89
5,ATL,Carleton,Andrew,F,65005.0,0.0,77400.0,142400.0


In [31]:
df.tail()

Unnamed: 0_level_0,club,last_name,first_name,position,base_salary_usd,base_salary_eur,guaranteed_compensation_usd,final_salary_usd
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
611,VAN,Teibert,Russell,M,126505.0,0.0,194000.0,320500.0
612,VAN,Tornaghi,Paolo,GK,80005.0,0.0,80000.0,160000.0
613,VAN,Waston,Kendall,D,350005.0,0.0,368125.0,718125.0
614,,,,,,,,
615,VAN,Williams,Sheanon,D,175005.0,0.0,184000.0,359000.0


# The End