# Housekeeping stuff

## What is pandas?

Pandas is a fast, powerful, flexible and easy-to-use open source data analysis and manipulation tool built on top of the Python programming language.

It is also one of the most popular libraries used by data experts from all around the world.

## What can you do with pandas?

Pandas is used for data wrangling, data analysis and data visualisation.

Some examples include creating and merging dataframes, dropping unwanted columns and rows, locating and filling null values, grouping data by category, creating basic plots like barplot, scatter plot, histogram etc.

## Why should you learn to use pandas?

As humans interact more and more with technology, vast amounts of data are being generated each day. Hence, the ability to analyse these data and draw insights from them is becoming an increasingly important skill to have in the modern workforce. Organisations are progressively turning to data to help them better understand their customers and products, analyse past trends and patterns, improve operational efficiency and so on.

Here are just some of the many reasons why you should learn pandas:
- By learning pandas, you learn the fundamental ideas behind working with data as well as some skill and knowledge to code in Python
- It is straightforward to learn and you can immediately apply it to any dataset you want
- It is commonly used in the data science and machine learning community

## Where can you find pandas?

Best way to get access to pandas is by installing Anaconda (https://docs.anaconda.com/anaconda/install/) which is a distribution of the Python and R programming languages, both of which are heavily used in data science.

By installing Anaconda, you will also have access to Jupyter notebook which is what I am using to write up this documentation. Jupyter notebook allows you to easily run your Python code cell by cell.

## What I hope to do with this video series?

This video series is going to be a complete beginner's course on how to use pandas. I won't expect that you have any prior knowledge or background in data science or even programming in general.  

Through this video series, I aim to pass on what I have learned about pandas thus far and furthermore inspire people to incorporate pandas into their future data analysis work whether that is for their university assignment, side projects or professional work.

On your end, the best way to gain value out of this video series is by doing. Programming is just like driving - you don't learn how to drive merely by reading about it or watching a video of someone else do it, you have to actually do it yourself. So I highly encourage you to install Jupyter notebook on your computer and have a go at using pandas yourself after you finish watching my weekly content.

# Week 1: Reading csv files & creating your own dataframe

To use pandas, we have to first import the pandas library and the way you do that is as follows

In [2]:
# import pandas and label it as 'pd'
import pandas as pd

## Reading csv files

For this part of the tutorial, visit https://www.kaggle.com/c/titanic/data to download the file containing the titanic dataset. Once you have downloaded the file, unzip it i.e. extract its content out of the file. Keep in mind where the file is on your compute because next, we need to specify the location of the file in Jupyter notebook in order to load the data.

In [3]:
# print working directory

pwd

'C:\\Users\\Jason Chong'

In [8]:
# read data via 'pd.read_csv'

train = pd.read_csv('C:/Users/Jason Chong/Documents/Kaggle/titanic/train.csv')
test = pd.read_csv('C:/Users/Jason Chong/Documents/Kaggle/titanic/train.csv')

Let's have a look at our datasets

In [9]:
# 'head' shows the first five rows of the dataframe by default but you can specify the number of rows in the parenthesis

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
# 'tail' shows the bottom five rows by default

test.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [13]:
# 'shape' function tells us how many rows and columns exist in a dataframe

train.shape

(891, 12)

## Creating your own dataframe

In [59]:
# number entries

test_scores = pd.DataFrame({'Science': [50, 75, 31], 'Geography': [88, 100, 66], 'Math': [72, 86, 94]}, columns = ['Geography', 'Science', 'Math'])
test_scores

Unnamed: 0,Geography,Science,Math
0,88,50,72
1,100,75,86
2,66,31,94


In [61]:
# text entries
# we can add index via the 'index' argument 

survey = pd.DataFrame({'James': ['I liked it', 'It could use a bit more salt'], 'Emily': ['It was too sweet', 'Yum!']},
                     index = ['Product A', 'Product B'])
survey

Unnamed: 0,James,Emily
Product A,I liked it,It was too sweet
Product B,It could use a bit more salt,Yum!


## Rename and reset index

In [57]:
# reset index
# try playing around with 'drop' and 'inplace' and see what they do

test_scores.reset_index(drop = True, inplace = True)
test_scores

Unnamed: 0,Geography,Science,Math
0,88,50,72
1,100,75,86
2,66,31,94


In [58]:
# rename columns
# suppose we want to change the name of the first two columns

test_scores.rename(columns = {'Geography': 'Physics', 'Science': 'Arts'}, inplace = True)
test_scores

# Alternatively, this also works
# test_scores = test_scores.rename(columns = {'Geography': 'Physics', 'Science': 'Arts'})

Unnamed: 0,Physics,Arts,Math
0,88,50,72
1,100,75,86
2,66,31,94


## Dropping columns and rows

In [62]:
# dropping columns 

test_scores = test_scores.drop(columns = 'Math')
test_scores

Unnamed: 0,Geography,Science
0,88,50
1,100,75
2,66,31


In [63]:
# dropping rows

test_scores.drop(1, inplace= True)
test_scores

Unnamed: 0,Geography,Science
0,88,50
2,66,31


## Series

There are two core objects in pandas, one is dataframe which we have already gone through, the other is called a series.

Dataframe, as we have seen, looks like a data table. A series on the other hand is a sequence of data values or sometimes called a list.

In [48]:
pd.Series([0, 1, 2, 3, 4])

0    0
1    1
2    2
3    3
4    4
dtype: int64

You can think of series as being a single column within a dataframe and so we can assign a index label to a series just like how we would with a dataframe.

In [49]:
profit = pd.Series([75, 26, 38], index = ['2018 Profit', '2019 Profit', '2020 Profit'])
profit

2018 Profit    75
2019 Profit    26
2020 Profit    38
dtype: int64

Using this same logic, we can form a dataframe using a list of list i.e. a combination of series. Let's see how we can do that.

In [53]:
customer_sales = pd.DataFrame([[317, 'Melbourne', '80'], [887, 'New York', '91'], [225, 'London', '50']], columns = ['Customer_ID', 'City', 'Sales'])
customer_sales

Unnamed: 0,Customer_ID,City,Sales
0,317,Melbourne,80
1,887,New York,91
2,225,London,50


Unlike before when we were creating our dataframe by column, when creating a dataframe using a series, a single list corresponds to a single row in the dataframe.