Importamos Pandas y Numpy

In [None]:
import pandas as pd
import numpy as np

Then comes connecting to working directories.


In [None]:
import os
# show current working directory
os.getcwd()
# list files in the directory
os.listdir()
# change working directory
os.chdir("/")

#2. Loading data
Now load in your datasets from the repository (desktop, cloud, SQL server — wherever stored). It’s a good idea to make a copy of the original dataset and work with the copy because you’ll be doing a lot of modifications to the original one.

In [None]:
# import from a csv file
data = pd.read_csv("/content/Iris.csv")

# copying a dataset
df = data.copy()
# call the head function
df.head(6)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa


#3. Initial data screening


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [None]:
# number of rows and columns
df.shape
# column names
df.columns


Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [None]:
# number of unique values
df["Species"].nunique()
# name of the unique values
df["Species"].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [None]:
# count of categorical data
df["Species"].value_counts()

Iris-versicolor    50
Iris-virginica     50
Iris-setosa        50
Name: Species, dtype: int64

#4. Missing value treatment
Missing values are no surprise. The yellow highlighted cells in the dataframe above are NaN values. You can search for the number of missing values in a dataset by typing the following:


In [None]:
# show NaN values per feature
df.isnull().sum()


Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

You could also obtain missing values as a percentage of total observations (it is quite useful for large datasets).


In [None]:
# NaN values as % of total observations
df.isnull().sum()*100/len(df)

Id               0.0
SepalLengthCm    0.0
SepalWidthCm     0.0
PetalLengthCm    0.0
PetalWidthCm     0.0
Species          0.0
dtype: float64

In the iris dataset we have two missing values in the sepal_length column. Now that we found out, what to do with them? You can do one of the following:

a ) drop the rows or columns containing null values;


In [None]:
## Drop row/column ##
#####################
# drop all rows containing null
df.dropna()
# drop all columns containing null
df.dropna(axis=1)
# drop columns with less than 5 NaN values
df.dropna(axis=1, thresh=5)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


b ) or, replace/impute missing cells with some other values;


In [None]:
## Replace values ##
####################
# replace all na values with -9999
df.fillna(-9999)
# additional tip: you can also replace any specific cell values
df.at[1, "sepal_length"]= 9999
# fill na values with NaN
df.fillna(np.NaN)
# fill na values with strings
df.fillna("data missing")
# fill missing values with mean column values
df.fillna(df.mean())
# replace na values of specific columns with mean value
df["sepal_length"].fillna(df["sepal_length"].mean())

0      9999.0
1      9999.0
2      9999.0
3      9999.0
4      9999.0
        ...  
145    9999.0
146    9999.0
147    9999.0
148    9999.0
149    9999.0
Name: sepal_length, Length: 150, dtype: float64

c ) or, if it’s time-series data, interpolation is a great way to impute data.


In [None]:
## Interpolate ##
#################
# interpolation of missing values (useful in time-series)
df.interpolate() # all dataframe
df["sepal_length"].interpolate() # specific column

0         NaN
1      9999.0
2      9999.0
3      9999.0
4      9999.0
        ...  
145    9999.0
146    9999.0
147    9999.0
148    9999.0
149    9999.0
Name: sepal_length, Length: 150, dtype: float64

#5. Subsetting & working with columns
Not all columns in the dataset are of interest, sometimes we select specific columns for analytics or building a model. Subsetting allows you to do that.
There are two key ways to select columns: by column names and by column positions:


In [None]:
# select a column by column name
df["SepalLengthCm"]
# select multiple columns by column name
df[["sepal_lenght", "sepal_width", "petal_length", "spp"]]
# select a column by column number
df.iloc[:, 2:4]
# select multiple columns by column number
df.iloc[:, [1,3,4]]

KeyError: ignored

But what if you want to subset data by dropping a column?


In [None]:
# drop a column 
df.drop("sepal_length", axis=1)


Now let’s say you want to create a new column by adding two existing columns:
sepal_len_cm= sepal_length* 10
Creating new calculated columns is often big part of feature engineering.

In [None]:
# add new calculated column
df['new'] = df["sepal_length"]*2
# create a conditional calculated column
df['newcol'] = ["short" if i<3 else "long" for i in df["sepal_width"]] 


Sometimes re-coding may be needed to convert categorical string values to numeric values.


In [None]:
df.replace({"Species":{"setosa":1, "versicolor":2, "virginica":3}})


If aggregation of column values is needed (mean/median etc.), python and numpy has native functions that can be applied to the dataframe.


In [None]:
# calculate mean of each of two columns
df[["sepal_length", "sepal_width"]].mean()
# calculate sum and mean of each column
df[["sepal_length", "sepal_width"]].agg([np.sum, np.mean])

And finally, some bonus syntax useful to work with columns:


In [None]:
# transposing a dataset
df.T
# create a list of columns
df.columns.tolist()
# sorting values in ascending order
df.sort_values(by = "sepal_width", ascending = True)
# change column name
df.rename(columns={"old_name": "new_name"})

#6. Filtering: working with rows

Filtering is an important part of exploratory data analysis, drawing insights and building KPIs.
There are many ways to filter data depending on the analytics needs, such as:
a) using the row index location:

In [None]:
# select rows with index number 3 to 10
df.iloc[3:10,]
# select rows with index name
df.loc["index1", "index2"]
# finding rows with specific strings
df[df["species"].isin(["setosa"])]

b) conditional filtering


In [None]:
# simple conditional filtering to filter rows with sepal_length>=5
df.query('sepal_length>=5') # or
df[df.sepal_length>= 5]
# filtering rows with multiple values e.g. 0.2, 0.3
df[df["petal_length"].isin([0.2, 0.3])]
# multi-conditional filtering
df[(df.sepal_length>1) & (df.species=="setosa") | (df.sepal_width<3)]

And finally, here’s how you’d get rid of a row if needed.


In [None]:
# drop rows
df.drop(df.index[1]) # 1 is row index to be deleted


#7. Grouping

Like filtering, grouping is another important part of exploratory data analysis and data visualization. The key function for this task is groupby() and is mainly used for aggregating rows based on categorical features.


In [None]:
# return a dataframe object grouped by "species" column
df.groupby("Species")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb5a317f0d0>

After the dataframe is grouped, you could apply different functions to it, for example, getting aggregate values of numeric columns:


In [None]:
# return mean a column groupby "species" categories
df["sepal_length"].groupby(df["species"]).mean()

Or you can apply such aggregate function to multiple features:


In [None]:
# group each column by "species", then apply multiple operation on each feature 
df.groupby("species").agg([np.sum, np.mean, np.std])

#8. Joining/merging

If you know SQL I don’t have to explain how important joining is. Python and pandas have some functions such as merge(), join(), concat() for SQL style joining. If SQL is the primary database you probably won’t have to do joining much in Python, but nevertheless you should add the following codes to your cheatsheet.

In [None]:
# SQL style joining
df1 = df[["sepal_length", "sepal_width"]]
df2 = df[["sepal_length", "petal_length"]]
dfx = pd.concat([df1, df2], axis = 1)