# Big Data Real-Time Analytics with Python and Spark

## Chapter 3 - Data Manipulation in Python with Pandas
- Documentation: https://pandas.pydata.org/

-- **Part 1**
- Create a dataframe from a dictionary (organizing the columns and selecting)
- Print only the name of columns or only the index name

-- **Part 2**
- Create a nested dictionary (A dictionary which the value is another dictionary)
- Another way to create a dataframe from dictionary (with pd.DataFrame.from_dict)
- Many ways to get a column or line (slicing and methods)
 
-- **Part 3**
- Slicing recal exercice
- Method loc and iloc 
- Some operationa

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.8


In [2]:
# Import the pandas module
import pandas as pd

In [3]:
# package version used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversion

Author: Bianca Amorim

pandas: 1.5.0



## Operation with dataframes
### Part 1

In [4]:
# Create a dictionary with key and values pairs
stock = {'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88],
                           index = ['Closing price', 'EPS', 'Shares Outstanding(M)', 'Beta', 'P/E', 'Market Cap(B)']),
        'GOOG': pd.Series([1133.43,36.05,355.83,0.87,31.44,380.64],
                           index = ['Closing price', 'EPS', 'Shares Outstanding(M)', 'Beta', 'P/E', 'Market Cap(B)']),
        'FB': pd.Series([61.48,0.59,2450,104.93,150.92],
                           index = ['Closing price', 'EPS', 'Shares Outstanding(M)', 'P/E', 'Market Cap(B)']),
        'YHOO': pd.Series([34.90,1.27,1010,27.48,0.66,35.36],
                           index = ['Closing price', 'EPS', 'Shares Outstanding(M)', 'P/E', 'Beta', 'Market Cap(B)']),
        'TWTR': pd.Series([65.25,-0.3,555.2,36.23],
                           index = ['Closing price', 'EPS', 'Shares Outstanding(M)', 'Market Cap(B)']),
        'APPL': pd.Series([501.53,40.32,892.45,12.44,447.59,0.84],
                           index = ['Closing price', 'EPS', 'Shares Outstanding(M)', 'P/E', 'Market Cap(B)', 'Beta'])}

In [5]:
#type of the object
type(stock)

dict

In [6]:
# View
stock

{'AMZN': Closing price            346.15
 EPS                        0.59
 Shares Outstanding(M)    459.00
 Beta                       0.52
 P/E                      589.80
 Market Cap(B)            158.88
 dtype: float64,
 'GOOG': Closing price            1133.43
 EPS                        36.05
 Shares Outstanding(M)     355.83
 Beta                        0.87
 P/E                        31.44
 Market Cap(B)             380.64
 dtype: float64,
 'FB': Closing price              61.48
 EPS                         0.59
 Shares Outstanding(M)    2450.00
 P/E                       104.93
 Market Cap(B)             150.92
 dtype: float64,
 'YHOO': Closing price              34.90
 EPS                         1.27
 Shares Outstanding(M)    1010.00
 P/E                        27.48
 Beta                        0.66
 Market Cap(B)              35.36
 dtype: float64,
 'TWTR': Closing price             65.25
 EPS                       -0.30
 Shares Outstanding(M)    555.20
 Market Cap(B)     

In [7]:
# Convert a dictionary in a dataframe (Pandas will decide the order)
stock_df = pd.DataFrame(stock)

In [8]:
stock_df.head(6)

Unnamed: 0,AMZN,GOOG,FB,YHOO,TWTR,APPL
Beta,0.52,0.87,,0.66,,0.84
Closing price,346.15,1133.43,61.48,34.9,65.25,501.53
EPS,0.59,36.05,0.59,1.27,-0.3,40.32
Market Cap(B),158.88,380.64,150.92,35.36,36.23,447.59
P/E,589.8,31.44,104.93,27.48,,12.44
Shares Outstanding(M),459.0,355.83,2450.0,1010.0,555.2,892.45


In [20]:
# We can convert the dictionary in a dataframe defining the index of each line informing the order
stock_df = pd.DataFrame(stock,
                       index = ['Closing price',
                                'EPS',
                                'Shares Outstanding(M)',
                                'P/E',
                                'Market Cap(B)',
                                'Beta'])

In [22]:
# View the data
stock_df.head(6)

Unnamed: 0,AMZN,GOOG,FB,YHOO,TWTR,APPL
Closing price,346.15,1133.43,61.48,34.9,65.25,501.53
EPS,0.59,36.05,0.59,1.27,-0.3,40.32
Shares Outstanding(M),459.0,355.83,2450.0,1010.0,555.2,892.45
P/E,589.8,31.44,104.93,27.48,,12.44
Market Cap(B),158.88,380.64,150.92,35.36,36.23,447.59
Beta,0.52,0.87,,0.66,,0.84


**Note:** We can generate NaN values. The NaN value can may arise because of the manipulation we are doing. So its good after each manipulation verify if NaN values arise.

In [23]:
# We can define only the columns that we want in the dataframe
stock_df = pd.DataFrame(stock,
                       index = ['Closing price',
                                'EPS',
                                'Shares Outstanding(M)',
                                'P/E',
                                'Market Cap(B)',
                                'Beta'],
                       columns = ['AMZN','GOOG','FB'])

In [24]:
# View the data
stock_df.head(6)

Unnamed: 0,AMZN,GOOG,FB
Closing price,346.15,1133.43,61.48
EPS,0.59,36.05,0.59
Shares Outstanding(M),459.0,355.83,2450.0
P/E,589.8,31.44,104.93
Market Cap(B),158.88,380.64,150.92
Beta,0.52,0.87,


In [25]:
# View the index
stock_df.index

Index(['Closing price', 'EPS', 'Shares Outstanding(M)', 'P/E', 'Market Cap(B)',
       'Beta'],
      dtype='object')

In [26]:
# View the column names
stock_df.columns

Index(['AMZN', 'GOOG', 'FB'], dtype='object')

### Part 2

In [27]:
# Creating a nested dictionary (A dictionary which the value is another dictionary)
olympic_medals_table = {
    'EUA':{'Gold':46, 'Silver':37, 'Bronze': 38},
    'China':{'Gold':26, 'Silver':18, 'Bronze': 26},
    'England':{'Gold':27, 'Silver':23, 'Bronze': 17},
    'Germany':{'Gold':19, 'Silver':18, 'Bronze': 19},
    'Uruguay':{'Gold':17, 'Silver':10, 'Bronze': 15},
}

In [29]:
# Another way to convert the dictionary to a pandas dataframe (With the funtion from_dict)
df_olympic = pd.DataFrame.from_dict(olympic_medals_table)

In [30]:
type(df_olympic)

pandas.core.frame.DataFrame

In [31]:
df_olympic.head()

Unnamed: 0,EUA,China,England,Germany,Uruguay
Gold,46,26,27,19,17
Silver,37,18,23,18,10
Bronze,38,26,17,19,15


In [32]:
df_olympic['China']

Gold      26
Silver    18
Bronze    26
Name: China, dtype: int64

In [33]:
China_Medals = df_olympic['China']

In [35]:
# We can see that the dataframe is a set of Series
# So we can manipulate each column in pandas with Series methods and attributes
# Then we save the result again in the dataframe
type(China_Medals)

pandas.core.series.Series

In [34]:
China_Medals

Gold      26
Silver    18
Bronze    26
Name: China, dtype: int64

In [36]:
# Another way to do slicing (This way its not good when we have spaces in the columns name)
df_olympic.Uruguay

Gold      17
Silver    10
Bronze    15
Name: Uruguay, dtype: int64

In [37]:
# With more items in the slicing, we use a list with []
df_olympic[['Germany', 'EUA']]

Unnamed: 0,Germany,EUA
Gold,19,46
Silver,18,37
Bronze,19,38


In [38]:
# Another way to get one columns is with get
df_olympic.get('England')

Gold      27
Silver    23
Bronze    17
Name: England, dtype: int64

In [40]:
df_olympic

Unnamed: 0,EUA,China,England,Germany,Uruguay
Gold,46,26,27,19,17
Silver,37,18,23,18,10
Bronze,38,26,17,19,15


### Part 3 (Recal exercise)

In [41]:
df_olympic

Unnamed: 0,EUA,China,England,Germany,Uruguay
Gold,46,26,27,19,17
Silver,37,18,23,18,10
Bronze,38,26,17,19,15


In [42]:
df_olympic[:2]

Unnamed: 0,EUA,China,England,Germany,Uruguay
Gold,46,26,27,19,17
Silver,37,18,23,18,10


In [43]:
df_olympic[2:]

Unnamed: 0,EUA,China,England,Germany,Uruguay
Bronze,38,26,17,19,15


In [44]:
df_olympic[::2]

Unnamed: 0,EUA,China,England,Germany,Uruguay
Gold,46,26,27,19,17
Bronze,38,26,17,19,15


In [45]:
df_olympic[::-2]

Unnamed: 0,EUA,China,England,Germany,Uruguay
Bronze,38,26,17,19,15
Gold,46,26,27,19,17


In [46]:
df_olympic

Unnamed: 0,EUA,China,England,Germany,Uruguay
Gold,46,26,27,19,17
Silver,37,18,23,18,10
Bronze,38,26,17,19,15


In [47]:
# Loc method to localize something. Here we will locate everything that I have of gold in my dataframe
# They return like a column but it is the line
df_olympic.loc['Gold']

EUA        46
China      26
England    27
Germany    19
Uruguay    17
Name: Gold, dtype: int64

In [48]:
# All lines only to EUA
df_olympic.loc[:,'EUA']

Gold      46
Silver    37
Bronze    38
Name: EUA, dtype: int64

In [49]:
# 2 Ways to return the number of silver medals of China 
df_olympic.loc['Silver', 'China']

18

In [50]:
df_olympic.loc['Silver']['China']

18

In [51]:
df_olympic.loc['Silver']

EUA        37
China      18
England    23
Germany    18
Uruguay    10
Name: Silver, dtype: int64

In [52]:
# Return a bool of the operation
df_olympic.loc['Gold'] > 20

EUA         True
China       True
England     True
Germany    False
Uruguay    False
Name: Gold, dtype: bool

In [53]:
# Return the data only on the operation is true. We can use loc inside if I use outside too
df_olympic.loc['Gold',df_olympic.loc['Gold'] > 20] 

EUA        46
China      26
England    27
Name: Gold, dtype: int64

**Note:** With **loc** we use the value of the index to search the index, and with the **iloc** we can use the index to search the value.

In [55]:
# iloc to search a line with the index 0
df_olympic.iloc[0]

EUA        46
China      26
England    27
Germany    19
Uruguay    17
Name: Gold, dtype: int64

In [56]:
df_olympic.iloc[2]

EUA        38
China      26
England    17
Germany    19
Uruguay    15
Name: Bronze, dtype: int64

In [57]:
df_olympic

Unnamed: 0,EUA,China,England,Germany,Uruguay
Gold,46,26,27,19,17
Silver,37,18,23,18,10
Bronze,38,26,17,19,15


In [58]:
df_olympic.iloc[:2]

Unnamed: 0,EUA,China,England,Germany,Uruguay
Gold,46,26,27,19,17
Silver,37,18,23,18,10


In [59]:
df_olympic.iloc[2,0:2]

EUA      38
China    26
Name: Bronze, dtype: int64

In [60]:
# It will not generate an error because 3 is exclusive, so pandas do not look at 3
df_olympic.iloc[2:3, :]

Unnamed: 0,EUA,China,England,Germany,Uruguay
Bronze,38,26,17,19,15


In [61]:
df_olympic.iloc[1,:]

EUA        37
China      18
England    23
Germany    18
Uruguay    10
Name: Silver, dtype: int64

In [62]:
df_olympic.iloc[2, 0]

38

> We can delete part of the dataframe

In [63]:
# Delete a column
del df_olympic['EUA']

In [64]:
df_olympic

Unnamed: 0,China,England,Germany,Uruguay
Gold,26,27,19,17
Silver,18,23,18,10
Bronze,26,17,19,15


In [65]:
# To insert a new column
df_olympic.insert(0, 'Brazil', (17, 16, 23))

In [66]:
df_olympic

Unnamed: 0,Brazil,China,England,Germany,Uruguay
Gold,17,26,27,19,17
Silver,16,18,23,18,10
Bronze,23,26,17,19,15


In [67]:
df_olympic.describe()

Unnamed: 0,Brazil,China,England,Germany,Uruguay
count,3.0,3.0,3.0,3.0,3.0
mean,18.666667,23.333333,22.333333,18.666667,14.0
std,3.785939,4.618802,5.033223,0.57735,3.605551
min,16.0,18.0,17.0,18.0,10.0
25%,16.5,22.0,20.0,18.5,12.5
50%,17.0,26.0,23.0,19.0,15.0
75%,20.0,26.0,25.0,19.0,16.0
max,23.0,26.0,27.0,19.0,17.0


# The End