## Introduction

This is a notebook I use to learn more about Pandas. As such, I try to follow a basic path, as creating a data frame, describing the data and other steps. Some of the examples come from my work on data, and some wrangling I had to do to analyze and create models.

In [1]:
# Step 0. Load libraries
import pandas as pd
import numpy as np

### 1.1 Creating a dataframe

Pandas has many methods fo creating a new dataframe object. One easy method is to create an empty dataframe and then define each column separately.

In [2]:
# Create an empty data frame object
df = pd.DataFrame()

In [3]:
# Then create each column, and add elements using lists
df['id'] = [1,2,3,4,5,6]
df['call_id'] = [200,200,200,300,300,300]
df['result'] = ['answering machine','call back','call back',\
               'still workable','transfer call','do not call']
df['code_result'] = ['am','cb','cb','sw','tc','dc']

In [4]:
# Let's print the dataframe
df

Unnamed: 0,id,call_id,result,code_result
0,1,200,answering machine,am
1,2,200,call back,cb
2,3,200,call back,cb
3,4,300,still workable,sw
4,5,300,transfer call,tc
5,6,300,do not call,dc


### 1.2 Describing the data

We want to view some characteristics of a dataframe, like the mean, standard deviation and count of elements

In [5]:
# Let's load a dataset, from UCI
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt'
# This dataset comes without header
df = pd.read_csv(url, sep = ',', header = None, \
                 names=['varWav','skeWav','curtWav','entropy','class'])

This dataset comes from UCI Data Banknote Authentication dataset, and the data was extracted from images that were taken for the evaluation of an authentication procedure for banknotes. Wavelet Transform tool were used to extract features from images [1]. The features (and type) are:

1. variance of Wavelet Transformed image (continuous) 
2. skewness of Wavelet Transformed image (continuous) 
3. curtosis of Wavelet Transformed image (continuous) 
4. entropy of image (continuous) 
5. class (integer) 

In [6]:
# Let's look how is the dataset, the head, tail and dimensions
df

Unnamed: 0,varWav,skeWav,curtWav,entropy,class
0,3.62160,8.66610,-2.8073,-0.44699,0
1,4.54590,8.16740,-2.4586,-1.46210,0
2,3.86600,-2.63830,1.9242,0.10645,0
3,3.45660,9.52280,-4.0112,-3.59440,0
4,0.32924,-4.45520,4.5718,-0.98880,0
...,...,...,...,...,...
1367,0.40614,1.34920,-1.4501,-0.55949,1
1368,-1.38870,-4.87730,6.4774,0.34179,1
1369,-3.75030,-13.45860,17.5932,-2.77710,1
1370,-3.56370,-8.38270,12.3930,-1.28230,1


In [7]:
# Now let's see the statistics
df.describe()

Unnamed: 0,varWav,skeWav,curtWav,entropy,class
count,1372.0,1372.0,1372.0,1372.0,1372.0
mean,0.433735,1.922353,1.397627,-1.191657,0.444606
std,2.842763,5.869047,4.31003,2.101013,0.497103
min,-7.0421,-13.7731,-5.2861,-8.5482,0.0
25%,-1.773,-1.7082,-1.574975,-2.41345,0.0
50%,0.49618,2.31965,0.61663,-0.58665,0.0
75%,2.821475,6.814625,3.17925,0.39481,1.0
max,6.8248,12.9516,17.9274,2.4495,1.0


We need to be cautious because pandas treats the column 'class' as numeric, because it contains 1s and 0s, but in this case the numerical values represent categories. 

### 3.3 Navigating dataframes

Sometimes we need to select individual data or slices of a dataframe. 

In [8]:
# Let's select the first row
df.iloc[0]

varWav     3.62160
skeWav     8.66610
curtWav   -2.80730
entropy   -0.44699
class      0.00000
Name: 0, dtype: float64

In [9]:
# Now let's select three rows
df.iloc[0:3]

Unnamed: 0,varWav,skeWav,curtWav,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0


In [10]:
# We can index our dataset, for example by class
df = df.set_index(df['class'])

### 1.4 Selecting rows based on conditionals

In [11]:
# Let's read the Titanic Dataset
url = 'https://raw.githubusercontent.com/chrisalbon/kaggle/master/titanic/data/train.csv'
df_tc = pd.read_csv(url)
df_tc

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [12]:
# Let's filter by sex "female", and show only the first two rows
df_tc[df_tc['Sex']=='female'].head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


We have created a condition, that the column 'Sex' must be equal to 'female'. After that, we call the method head and specify we want only two columns. 

In [13]:
# Now let's filter by sex 'female' and passengers over 60
df_tc[(df_tc['Sex']=='female') & (df_tc['Age']>=60)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
366,367,1,1,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",female,60.0,1,0,110813,75.25,D37,C
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


Note that for two or more conditions, we need to enclose them in parenthesis.

### 1.5 Replacing values

It's is common to replace values, like renaming some classes, or converting to lower case. To do so in Pandas, you can use the method replace.

In [14]:
# Let's replace a pair of values
df_tc['Sex'].replace(['female','male'],['Woman','Man']).head(5)

0      Man
1    Woman
2    Woman
3    Woman
4      Man
Name: Sex, dtype: object

We have applied the replacement on the dataset, but it hasn't changed, just its copy. We can also find and replace across the entire dataset and not only in a single column:

In [15]:
# We can perform a replacement for all the values 1 into 'One'
df_tc.replace(1,"One").head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,One,0,3,"Braund, Mr. Owen Harris",male,22.0,One,0,A/5 21171,7.25,,S
1,2,One,One,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,One,0,PC 17599,71.2833,C85,C
2,3,One,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [16]:
# Now we can use regular expressions
df_tc.replace(r'Mr','Mister', regex=True).tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mister. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mister. Patrick",male,32.0,0,0,370376,7.75,,Q


### 1.6 Renaming columns

Now we can rename columns using the methed 'rename' in pandas.

In [17]:
df_tc.rename(columns={'Pclass':'Passenger class'}).head(2)

Unnamed: 0,PassengerId,Survived,Passenger class,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


Let's notice that we have used a dictionary to rename the column. We can add many other columns if we want to rename more columns in our dataset. 

### 1.7 Finding the minimum, maximum, sum, average and count

Sometimes we want only the values of interest, not all the dataframe or summaries. For that, we can use the methods max, min, mean, sum and count, and paste into strings or pass as variables.

In [18]:
# Let's paste some common descriptive statistics
f"For Age variable, max is: {df_tc['Age'].max():.0f}, min is: {df_tc['Age'].min():.0f}, \
mean is: {df_tc['Age'].mean():.2f}, sum of ages is {df_tc['Age'].sum():.0f} and count is {df_tc['Age'].count()}"

'For Age variable, max is: 80, min is: 0, mean is: 29.70, sum of ages is 21205 and count is 714'

In [19]:
# We can even count for all the variables
df_tc.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

### 1.8 Finding unique values

Now we can get all the unique values of a column, so we can use the method "unique" on our target column.

In [20]:
# Let's see the unique values of cabin
df_tc['Cabin'].unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

### 2. References

[1] UCI Machine Learning Repository. (May 22, 2021). Banknote Authentication Dataset. Retrieved from https://archive.ics.uci.edu/ml/datasets/banknote+authentication  
