# ISE224 LectureNote 7:  important packages of python

**Topics:**   
Introduction to pandas  
---

### What is pandas?

Pandas is a popular open-source data analysis and manipulation library for the Python programming language. It provides a powerful set of tools for working with **structured data**, including `data frames` (similar to spreadsheets or database tables) and `series` (one-dimensional labeled arrays). Pandas is widely used in data science, machine learning, finance, and other fields where large amounts of data need to be processed, analyzed, and visualized. It offers a wide range of features for data cleaning, merging, reshaping, slicing, indexing, filtering, and grouping, as well as statistical and time series analysis. 

It is also one of the most popular libraries used by data experts from all around the world.

### Why should you learn to use pandas?

- Pandas provides a powerful and flexible set of tools for data cleaning, manipulation, and analysis, making it easier and faster to work with large datasets  
- Pandas is open-source and free to use, making it accessible to anyone who wants to learn and use it  
- Pandas integrates well with other Python libraries commonly used in data science, such as NumPy, Matplotlib, and Scikit-learn, allowing you to build end-to-end data pipelines and machine learning models  
- Pandas offers a wide range of functions and methods for handling missing data, reshaping data, merging and joining datasets, grouping and aggregating data, and performing statistical and time series analysis  
- Pandas allows you to visualize and explore data in various formats, such as tables, charts, and graphs, making it easier to communicate insights to others.

In summary, learning pandas can significantly improve your data analysis skills and help you excel in your career, whether you are a data scientist, a business analyst, a researcher, or anyone who works with data.

### Package installation

If you haven't installed the `pandas` package to your computer, you have to install it first. But you just need to install one time.

type the following code in your **terminal** or **command prompt** (not in the python console) to install `pandas` package.

```
python -m pip install pandas
```

If you are using Mac OS and the previous code doesn't work, please try the following code in the **terminal**:
```
python3 -m pip install pandas
```

### Import package

To use pandas, you have to import the `pandas` library first before using it.

In [1]:
# Import pandas and label it as 'pd'

import pandas as pd

### Reading csv files

You will need to download the [titanic](https://www.kaggle.com/c/titanic/data) dataset on kaggle. Or you can download it from blackboard [titanic](https://github.com/cxc1920/ISE224/blob/main/pictures/titanic.zip). Once you have downloaded the file, unzip the file i.e. extract its content out of the file. Keep in mind where the file is on your compute because as we need to specify the location of the file in Jupyter notebook in order to load the data.

##### - `pd.read_csv`: Read data

In [2]:
# Read data via 'pd.read_csv'
# Use the appropriate read function for different file formats, for example pd.read_excel allows you to import files in excel format

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Use pandas_obj.head(), or pandas_obj.tail() to check the first(last) 5 data points in the dataset
##### - `pandas_obj.head()`  
##### - `pandas_obj.tail()`  

In [3]:
# 'head' shows the first five rows of the dataframe by default but you can specify the number of rows in the parenthesis

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
train.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [5]:
# 'tail' shows the bottom five rows by default

test.tail()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
417,1309,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


In [6]:
test.tail(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
417,1309,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


### Read Excel file

Download this [sample.xlsx](https://github.com/cxc1920/ISE224/blob/main/pictures/sample.xlsx) and put it to your current working directory.

In [7]:
sample = pd.read_excel('sample.xlsx')
sample

Unnamed: 0,id,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,virginica
146,147,6.3,2.5,5.0,1.9,virginica
147,148,6.5,3.0,5.2,2.0,virginica
148,149,6.2,3.4,5.4,2.3,virginica


##### - `pandas_obj.shape`: get the row number and column number of this data

In [8]:
# 'shape' function tells us how many rows and columns exist in a dataframe

train.shape

(891, 12)

### Create pandas data frame object

#### - `pandas.DataFrame()`: Creating your own dataframe

In [9]:
# Number entries: by column

test_scores = pd.DataFrame({'Student_ID': [154, 973, 645], 'Science': [50, 75, 31], 'Geography': [88, 100, 66],
                            'Math': [72, 86, 94]})
test_scores

Unnamed: 0,Student_ID,Science,Geography,Math
0,154,50,88,72
1,973,75,100,86
2,645,31,66,94


In [10]:
# Number entries: by row

test_scores = pd.DataFrame([[154, 50, 88, 72], [973, 75, 100, 86],[645, 31, 66, 94]],
                           columns = ['Student_ID','Science','Geography','Math'] )
test_scores

Unnamed: 0,Student_ID,Science,Geography,Math
0,154,50,88,72
1,973,75,100,86
2,645,31,66,94


In [11]:
# Text entries

survey = pd.DataFrame({'James': ['I liked this dish.', 'It could use a bit more salt'], 
                       'Emily': ['It is too sweet', 'Yum!']})
survey

Unnamed: 0,James,Emily
0,I liked this dish.,It is too sweet
1,It could use a bit more salt,Yum!


#### row Index

We can either set an existing column as our index or specify an index when creating a dataframe.

Let's begin by setting an an existing column as index.

In [12]:
# Number entries: by column

test_scores = pd.DataFrame({'Student_ID': [154, 973, 645], 'Science': [50, 75, 31], 'Geography': [88, 100, 66],
                            'Math': [72, 86, 94]},
                          index = ['a','b','c'])
test_scores

Unnamed: 0,Student_ID,Science,Geography,Math
a,154,50,88,72
b,973,75,100,86
c,645,31,66,94


In [13]:
test_scores = test_scores.set_index('Student_ID')
test_scores

Unnamed: 0_level_0,Science,Geography,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


Alternatively, we can specify an index column when creating a dataframe via the 'index' argument.

In [14]:
survey = pd.DataFrame({'James': ['I liked it', 'It could use a bit more salt'], 'Emily': ['It is too sweet', 'Yum!']},
                     index = ['Product A', 'Product B'])
survey

Unnamed: 0,James,Emily
Product A,I liked it,It is too sweet
Product B,It could use a bit more salt,Yum!


##### - `pandas_obj.reset_index(drop, inplance)`: The index back can be reset to its default.

`inplace=True` is a parameter in pandas that is used to modify a DataFrame or Series in place, *without creating a new object*. When you set inplace=True, the changes you make to the DataFrame or Series are made directly to the original object, instead of creating a copy of the object with the changes applied.

In [15]:
# Reset index
# Try playing around with 'drop' and 'inplace' and see what they do

survey.reset_index(drop = True, inplace = True)
survey

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!


#### Renaming columns 

In [16]:
# Suppose we want to change the names of the first two columns, inplace = False

new_test = test_scores.rename(columns = {'Geography': 'Physics', 'Science': 'Arts'}, inplace = False)
new_test

Unnamed: 0_level_0,Arts,Physics,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


In [17]:
test_scores

Unnamed: 0_level_0,Science,Geography,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


In [18]:
# Suppose we want to change the names of the first two columns, inplace = True

new_test2 = test_scores.rename(columns = {'Geography': 'Physics', 'Science': 'Arts'}, inplace = True)
new_test2 # None

In [19]:
test_scores

Unnamed: 0_level_0,Arts,Physics,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


#### Dropping columns and rows

There are a few of ways you can drop columns or rows from your dataframe. In this example, I am only focusing on the `drop` function.

In [20]:
# Drop the 'Math' column

test_scores.drop(columns = 'Math')

Unnamed: 0_level_0,Arts,Physics
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
154,50,88
973,75,100
645,31,66


In [21]:
# Drop row with student_ID 973
# We can make this more robust once we learn the 'loc' function in the coming weeks 
 
test_scores.drop(973)

Unnamed: 0_level_0,Arts,Physics,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
645,31,66,94


#### Adding columns and rows

In [22]:
test_scores

Unnamed: 0_level_0,Arts,Physics,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


In [23]:
# Create a new column for history subject

test_scores['History'] = [79, 70, 67]
test_scores

Unnamed: 0_level_0,Arts,Physics,Math,History
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
154,50,88,72,79
973,75,100,86,70
645,31,66,94,67


In [24]:
# add new row to DataFrame
test_scores.loc[len(test_scores.index)]=[88,80,99,98]

In [25]:
test_scores

Unnamed: 0_level_0,Arts,Physics,Math,History
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
154,50,88,72,79
973,75,100,86,70
645,31,66,94,67
3,88,80,99,98


In [26]:
# Add more product reviews from James and Emily
# Recall our survey dataframe

survey

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!


In [27]:
# Create two more rows

df = pd.DataFrame({'James': ['Not good', 'Meh'], 'Emily': ['My grandma can cook better', 'Pretty average']})
df

Unnamed: 0,James,Emily
0,Not good,My grandma can cook better
1,Meh,Pretty average


In [28]:
# Use the 'pd.concat' function

survey_cat1 = pd.concat([survey, df],axis=0)
survey_cat1

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!
0,Not good,My grandma can cook better
1,Meh,Pretty average


#### - `pd.concat()`: concatenate data frames

In [29]:
# Use the 'pd.concat' function with ignore_index=True

survey_cat1 = pd.concat([survey, df],axis=0,ignore_index=True)
survey_cat1

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!
2,Not good,My grandma can cook better
3,Meh,Pretty average


In [30]:
# Use the 'pd.concat' function with axis = 1

survey_cat2 = pd.concat([survey, df],axis=1)
survey_cat2

Unnamed: 0,James,Emily,James.1,Emily.1
0,I liked it,It is too sweet,Not good,My grandma can cook better
1,It could use a bit more salt,Yum!,Meh,Pretty average


In [31]:
# Use the 'pd.concat' function with axis = 1, ignore_index=True

survey_cat2 = pd.concat([survey, df],axis=1,ignore_index=True)
survey_cat2

Unnamed: 0,0,1,2,3
0,I liked it,It is too sweet,Not good,My grandma can cook better
1,It could use a bit more salt,Yum!,Meh,Pretty average


### Series

Pandas has two main data structures: **dataframe** and **series**.

A **dataframe** is a two-dimensional table-like structure that can store data of different types in columns. It is similar to a spreadsheet. We can think of a dataframe as a collection of series, where each column represents a series of data.

On the other hand, a **series** is a one-dimensional labeled array that can store data of any type (numeric, string, boolean, etc.). It is similar to a Python **list** or an **array**, but with labels or indices that can be used to access or manipulate the data.

In summary, while a dataframe is a collection of series arranged in a tabular format, a series is simply a labeled sequence of data values.

In [32]:
# pandas 'series' object

pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [33]:
# 'series' with index

profit = pd.Series([75, 80, 66], index = ['2018 Profit', '2019 Profit', '2020 Profit'])
profit

2018 Profit    75
2019 Profit    80
2020 Profit    66
dtype: int64

Using this same logic, we can form a dataframe using a list of list i.e. a combination of series. Let's see how we can do that.

In [34]:
customer_sales = pd.DataFrame([[317, 'Melbourne', '80'], 
                               [887, 'New York', '91'], 
                               [225, 'London', '50']], columns = ['Customer_ID', 'City', 'Sales'])
customer_sales

Unnamed: 0,Customer_ID,City,Sales
0,317,Melbourne,80
1,887,New York,91
2,225,London,50


Unlike before when we were creating our dataframe by column, when creating a dataframe using a series, a single list corresponds to a single row in the dataframe.

### Selecting data using `loc` and `iloc`

In [35]:
data = pd.read_csv("train.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Selecting a series/column in a dataframe

There are two ways you can select a column of a dataframe.

1. `pandas_obj.Name`  
2. `pandas_obj['Name']`  

What is the difference between the two? Well, they both do the exact same thing except the second one is more robust. Here is an example, say I rename the 'PassengerId' column to 'Passenger ID', data.Passenger ID would not work. 

Let's see it in action.

In [36]:
# Let's first try out it out on the Name feature

data.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [37]:
data['Sex']

0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: Sex, Length: 891, dtype: object

So both ways are able to give us the Name column without any issues.

In [38]:
# Rename 'PasengerId' column (we covered this in our tutorial last week)

data.rename(columns = {'PassengerId': 'Passenger ID'}, inplace = True)
data.head()

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Avoid using space in the column name

In [39]:
data['Passenger ID']

0        1
1        2
2        3
3        4
4        5
      ... 
886    887
887    888
888    889
889    890
890    891
Name: Passenger ID, Length: 891, dtype: int64

### Index-based selection 

We use iloc to select data based on their numerical position in the dataframe.

iloc takes two argument, first is row followed by column. It has a starting index of 0 that is 0 is first, 1 is second, 2 is third and so on.

In [40]:
# First row and all columns

data.iloc[0, :]

Passenger ID                          1
Survived                              0
Pclass                                3
Name            Braund, Mr. Owen Harris
Sex                                male
Age                                22.0
SibSp                                 1
Parch                                 0
Ticket                        A/5 21171
Fare                               7.25
Cabin                               NaN
Embarked                              S
Name: 0, dtype: object

In [41]:
# Fourth column that is the Name column and all rows
# Since starting index is 0 fourth column corresponds to index number 3

data.iloc[:, 3]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

Suppose we want to select a range of values.

iloc includes the first number but exclude the last number of the range. For example, if we want the second and third row of the first column, the code is as follows

In [42]:
# Second and third rows of the first 

data.iloc[1:3, 0]

1    2
2    3
Name: Passenger ID, dtype: int64

In [43]:
# First three rows and all columns

data.iloc[[0, 1, 2], :]

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [44]:
# Bottom five rows of the dataframe

data.iloc[-5:, :]

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [45]:
# This is the same as using the tail function

data.tail()

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


## Label-based selection 

With loc we need to specify the actual name of the column.

In [46]:
# First row of the Name column

data.loc[0, 'Name']

'Braund, Mr. Owen Harris'

Different to iloc, when we want to select a range of values, loc includes both the start as well as the end of the range.

For example, to get the first 5 rows under iloc we would have data[:5] whereas for loc we have data[:4] instead.

In [47]:
# First 5 rows of the Name, Sex and Age column

data.loc[:4, ['Name', 'Sex', 'Age']]

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0


### Conditional Selection

We can select rows that satisfy certain conditions. In this section, we will look at how that works.

In [48]:
# Rows with age 50

data.loc[data['Age'] == 50, :]

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
177,178,0,1,"Isham, Miss. Ann Elizabeth",female,50.0,0,0,PC 17595,28.7125,C49,C
259,260,1,2,"Parrish, Mrs. (Lutie Davis)",female,50.0,0,1,230433,26.0,,S
299,300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C
434,435,0,1,"Silvey, Mr. William Baird",male,50.0,1,0,13507,55.9,E44,S
458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
482,483,0,3,"Rouse, Mr. Richard Henry",male,50.0,0,0,A/5 3594,8.05,,S
526,527,1,2,"Ridsdale, Miss. Lucy",female,50.0,0,0,W./C. 14258,10.5,,S
544,545,0,1,"Douglas, Mr. Walter Donald",male,50.0,1,0,PC 17761,106.425,C86,C
660,661,1,1,"Frauenthal, Dr. Henry William",male,50.0,2,0,PC 17611,133.65,,S
723,724,0,2,"Hodges, Mr. Henry Price",male,50.0,0,0,250643,13.0,,S


In [49]:
# Rows with age 50 AND are female
# This is a subset of the above dataframe by filtering out females

data.loc[(data['Age'] == 50) & (data['Sex'] == 'female') ,:]

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
177,178,0,1,"Isham, Miss. Ann Elizabeth",female,50.0,0,0,PC 17595,28.7125,C49,C
259,260,1,2,"Parrish, Mrs. (Lutie Davis)",female,50.0,0,1,230433,26.0,,S
299,300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C
458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
526,527,1,2,"Ridsdale, Miss. Lucy",female,50.0,0,0,W./C. 14258,10.5,,S


In [50]:
# Rows with age 50 OR have fare greater than or equal to 200

data.loc[(data['Age'] == 50) | (data['Fare'] >= 200), :]

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
118,119,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.5208,B58 B60,C
177,178,0,1,"Isham, Miss. Ann Elizabeth",female,50.0,0,0,PC 17595,28.7125,C49,C
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
259,260,1,2,"Parrish, Mrs. (Lutie Davis)",female,50.0,0,1,230433,26.0,,S
299,300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C
311,312,1,1,"Ryerson, Miss. Emily Borie",female,18.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S
377,378,0,1,"Widener, Mr. Harry Elkins",male,27.0,0,2,113503,211.5,C82,C


In [51]:
# All the rows with null cabin column

data.loc[data['Cabin'].isnull(), :]

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


The exact opposite to the isnull function is the notnull function which returns series without any null values.

In [52]:
# All rows with C or Q in Embarked column

data.loc[data['Embarked'].isin(['C', 'Q']), :]

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [53]:
# This is the same as if we had used the or statement

data.loc[(data['Embarked'] == 'C') | (data['Embarked'] == 'Q'), :]

Unnamed: 0,Passenger ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Write data to CSV/Excel file

##### - `pandas_obj.to_csv('CSVfilename.csv', index=True/False)`: Write pandas_obj to CSVfilename.csv

In [54]:
# create a sample dataframe
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
        'Age': [25, 30, 35, 40],
        'Country': ['USA', 'Canada', 'UK', 'Australia']})

# write the dataframe to a excel file
df.to_csv('pandas_data.csv', index=False)

##### - `pandas_obj.to_excel('Excelfilename.xlsx', index=True/False)`: Write pandas_obj to Excelfilename.xlsx

In [55]:
# write the dataframe to a excel file
df.to_excel('pandas_data.xlsx', index=False)

In [56]:
# write the dataframe to a excel file
df.to_excel('pandas_data_idx_True.xlsx', index=True)