# Big Data Real-Time Analytics with Python and Spark

## Chapter 3 - Data Manipulation in Python with Pandas
- Documentation: https://pandas.pydata.org/
- How to import pandas
- How to import a file to a dataframe
- How to see type, shape, size
- head(), tail() methods
- info(), describe() methods
- Best way to use value_counts()
- How to extract a column from the dataframe
- How to use methods and attribute to a specific column
- How to create a dataframe from datastructure

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.8


In [2]:
# to update a package (!pip install -U package_name)
# to install an specific package version (!pip install package-name==version)
# (pip) is a python package installer and use(!) to run the command in the OS
# Use (-q) to quite istallation and (-U) to update  if the package already exists

# After install or update the package, restart jupyter notebook

# Install watermark package
# This package is used to record the versions of other packages used in this jupyter notebook
!pip install -q -U watermark

In [3]:
# Import the pandas module
import pandas as pd

In [4]:
# package version used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversion

Author: Bianca Amorim

pandas: 1.5.0



In [5]:
!pip install -q -U pandas

In [6]:
# package version used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversion

Author: Bianca Amorim

pandas: 1.5.0



## Pandas - Dataframes, Methods and Attributes

In [7]:
# Loads a file to disk and stores it as a dataframe
df = pd.read_csv('datasets/dataset1.csv')

In [8]:
# type of the object
type(df)

pandas.core.frame.DataFrame

In [9]:
# Shape attribute
df.shape

(615, 6)

In [10]:
# Size attribute (Total of records considering each column 615 x 6)
df.size

3690

In [11]:
# head() method
df.head(10)

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation
0,ATL,Almiron,Miguel,M,1912500.0,2297000.0
1,ATL,Ambrose,Mikey,D,65625.0,65625.0
2,ATL,Asad,Yamil,M,150000.0,150000.0
3,ATL,Bloom,Mark,D,99225.0,106573.89
4,ATL,Carleton,Andrew,F,65000.0,77400.0
5,ATL,Carmona,Carlos,M,675000.0,725000.0
6,ATL,Garza,Greg,D,150000.0,150000.0
7,ATL,Gonzalez Pirez,Leandro,D,250008.0,285008.0
8,ATL,Goslin,Chris,M,70000.0,74000.0
9,ATL,Gressel,Julian,,75000.0,93750.0


In [12]:
# tail() method
df.tail(10)

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation
605,VAN,Rosales,Mauro,M,65004.0,65004.0
606,VAN,Seiler,Cole,D,54075.0,54075.0
607,VAN,Shea,Brek,M-D,625000.0,670000.0
608,VAN,Tchani,Tony,M,275000.0,308333.33
609,VAN,Techera,Cristian,M,352000.0,377000.0
610,VAN,Teibert,Russell,M,126500.0,194000.0
611,VAN,Tornaghi,Paolo,GK,80000.0,80000.0
612,VAN,Waston,Kendall,D,350000.0,368125.0
613,,,,,,
614,VAN,Williams,Sheanon,D,175000.0,184000.0


In [13]:
# Info() method
# Only with this single command we have a grater summary about our dataframe
# In the Non-Null Count we can see how many NaN we have, and in each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 615 entries, 0 to 614
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   club                     614 non-null    object 
 1   last_name                614 non-null    object 
 2   first_name               610 non-null    object 
 3   position                 604 non-null    object 
 4   base_salary              614 non-null    float64
 5   guaranteed_compensation  614 non-null    float64
dtypes: float64(2), object(4)
memory usage: 29.0+ KB


In [14]:
# Describe() method
# Describe the data with a estatistics summary of the numerical variables
# Return: count, mean, standard deviation, min value, percents (50% is the median), max value
df.describe()

Unnamed: 0,base_salary,guaranteed_compensation
count,614.0,614.0
mean,297173.0,326375.2
std,672583.9,749121.7
min,52999.92,52999.92
25%,65633.4,70030.35
50%,125000.0,135002.0
75%,255000.0,279875.0
max,6660000.0,7167500.0


In [15]:
# Describe() method to one column (We can see a estatistical summary of text variables)
df.club.describe()

count     614
unique     23
top       VAN
freq       32
Name: club, dtype: object

In [16]:
# value_counts() Method (But this is not the best way to use it, because is a lot information)
df.value_counts()

club  last_name  first_name  position  base_salary  guaranteed_compensation
ATL   Almiron    Miguel      M         1912500.0    2297000.0                  1
ORL   Perez      Matias      F         260004.0     260004.0                   1
      Redding    Tommy       D         110000.0     117500.0                   1
      Rivas      Carlos      M         375000.0     375000.0                   1
      Rocha      Tony        M         65620.8      65620.8                    1
                                                                              ..
KC    Iwasa      Cameron     F         53004.0      53004.0                    1
      Juliao     Igor        D         100008.0     115008.0                   1
      Medranda   Jimmy       D         130008.0     130008.0                   1
      Melia      Tim         GK        165000.0     167500.0                   1
VAN   de Jong    Marcel      D-M       140000.0     140000.0                   1
Length: 600, dtype: int64

In [17]:
# Better ways to use value_counts() (Using more details before calls value_counts)
# Use with data types
df.dtypes.value_counts()

object     4
float64    2
dtype: int64

In [18]:
# Use with a column (Count the values for each category of the variable club)
df.club.value_counts()

VAN      32
PHI      31
ATL      31
CLB      30
ORL      30
DAL      29
SJ       29
NYCFC    28
HOU      28
NYRB     28
TOR      27
RSL      27
POR      27
MTL      27
CHI      27
MNUFC    27
LA       27
DC       27
KC       26
COL      26
SEA      25
NE       23
LAFC      2
Name: club, dtype: int64

We can use the attributes and methods that we see above with the specific column, but for that, we need to know how to extract the column first.

In [19]:
# 1. Extract data from column
df.club

0      ATL
1      ATL
2      ATL
3      ATL
4      ATL
      ... 
610    VAN
611    VAN
612    VAN
613    NaN
614    VAN
Name: club, Length: 615, dtype: object

In [20]:
# 2. # Extract data from column (slicing)
df['club']

0      ATL
1      ATL
2      ATL
3      ATL
4      ATL
      ... 
610    VAN
611    VAN
612    VAN
613    NaN
614    VAN
Name: club, Length: 615, dtype: object

In [21]:
# View the data extract of one column
df['club'].head(10)

0    ATL
1    ATL
2    ATL
3    ATL
4    ATL
5    ATL
6    ATL
7    ATL
8    ATL
9    ATL
Name: club, dtype: object

In [22]:
# View the data type extract of one column ('O' of object)
df['club'].dtypes

dtype('O')

In [23]:
# Data type in a column (See)
type(df['club'])

pandas.core.series.Series

When we do this slicing (df['club']), it returns a vector, which in pandas is called series. A dataframe is a set of series. We can manipulate each column as an independent data structure. So, this df['club'] is a series, but the data type inside is object (string in pandas)

In [29]:
# We can use the dataFrame() function to create pandas dataframes from python data structure
# we create the data structure, then we convert to a pandas dataframe, so I can use the pandas properties  
data = [['a',1,1.0], ['b',2,2.0], ['c',3,3.0]]
df_test = pd.DataFrame(data)

In [30]:
df_test.head()

Unnamed: 0,0,1,2
0,a,1,1.0
1,b,2,2.0
2,c,3,3.0


# The End