## Pandas

Pandas contain data structures and data manipulation tools designed for data cleaning and analysis. While pandas adopt many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

The name Pandas is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.

1. **[Pandas Series](#series)**
    - 1.1 - [Creating a Series](#creatingS)
    - 1.2 - [Manipulating Series](#manipulatingS)
2. **[Pandas DataFrames](#dataframes)**
    - 2.1 - [Creating DataFrames](#creatingDF)
    - 2.2 - [Manipulating DataFrames](#manipulatingDF)
3. **[Pandas Basics](#pandas)**
4. **[Boolean Masking](#masking)**
5. **[Grouping and Aggregation](#grouping)**


In [1]:
# Import the pandas library as pd
import pandas as pd

<a id="series"> </a>
## 1. Pandas Series

Pandas has two data structures as follows:<br>
1. A Series is 1-dimensional labeled array that can hold data of any type (integer, string, boolean, float, python objects, and so on). It’s axis labels are collectively called an index.<br>
2. A DataFrame is 2-dimensional labeled data structure with columns. It supports multiple datatypes.

#### Introduction to Pandas Series and Creating Series

Pandas Series is a one-dimensional labeled array capable of holding any data type. However, a series is a sequence of homogeneous data types, similar to an array, list, or column in a table.

It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.
<br>



<a id="creatingS"> </a>
#### 1.1 Creating a Series

**1. To create a numeric series** 

In [8]:
# Create a numeric series
numbers = range(1,50,5)
pd.Series(numbers)

0     1
1     6
2    11
3    16
4    21
5    26
6    31
7    36
8    41
9    46
dtype: int64

The output also gives the data type of the series as `int64` . And note that by default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

**2. To create an object series** 

In [3]:
# Create an object series
string = "Hi" , "How" ,"are", "you", "?"
print(pd.Series(string))

# Create a Series with an arbitrary list using both numeric and string values 
s = pd.Series([345, 'London', 34.5, -34.45, 'Happy Birthday'])
print("\n", s)

0     Hi
1    How
2    are
3    you
4      ?
dtype: object

 0               345
1            London
2              34.5
3            -34.45
4    Happy Birthday
dtype: object


**3. To set index values for a series**

Have an index added using the argument `index=`. The data type of the series continues to be numeric.

In [4]:
marks = [60, 89, 74, 86]
subject = ["Maths", "Science", "English" , "Social Science"]

pd.Series(marks, index = subject) 

Maths             60
Science           89
English           74
Social Science    86
dtype: int64

**4. To create a series from a dictionary**

On passing a dict, the index in the resulting Series will have the dict’s keys in sorted order.

In [10]:
data = {'Maths': 60, 'Science': 89, 'English': 76, 'Social Science': 86}

pd.Series(data)

Maths             60
Science           89
English           76
Social Science    86
dtype: int64

**5. A series with missing values**

If we pass a key that is not defined then its value will be `NAN`.

In [11]:
subjects = ["Maths", "Science", "Art and Craft" , "Social Science"]
marks_series = pd.Series(data, index = subjects)

print(marks_series)

Maths             60.0
Science           89.0
Art and Craft      NaN
Social Science    86.0
dtype: float64


<a id="manipulatingS"> </a>
### 1.2 Manipulating Series 
#### Manipulating series

**1. To check for null values using `.isnull`**

`False` indicates that the value is not null.

In [13]:
marks_series.isnull()

Maths             False
Science           False
Art and Craft      True
Social Science    False
dtype: bool

**2. To check for null values using `.notnull`**

` True` indicates that the value is not null.

In [14]:
marks_series.notnull()

Maths              True
Science            True
Art and Craft     False
Social Science     True
dtype: bool

**3. To know the subjects in which marks score is more than 75**

In [15]:
marks_series[marks_series > 75]

Science           89.0
Social Science    86.0
dtype: float64

**4. Sorting a numeric and categorical series**

In [22]:
# create a pandas series
import numpy as np
values = pd.Series([23, 45, np.nan, 41, 55, np.nan, 34, 20])
print(values, "\n")

# ascending order
print(values.sort_values(ascending = True))

0    23.0
1    45.0
2     NaN
3    41.0
4    23.0
5    34.0
6    55.0
7     NaN
8    34.0
9    20.0
dtype: float64 

9    20.0
0    23.0
4    23.0
5    34.0
8    34.0
3    41.0
1    45.0
6    55.0
2     NaN
7     NaN
dtype: float64


In [23]:
# create a pandas series
string_values = pd.Series(["a", "j", "f", "t", "a"])
print(string_values, "\n")

# ascending order
print(string_values.sort_values(ascending = True))

0    a
1    j
2    f
3    t
4    a
dtype: object 

0    a
4    a
2    f
1    j
3    t
dtype: object


<a id="dataframes"> </a>
## 2. Pandas Dataframes

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> A DataFrame is a tabular representation of data containing an ordered collection of columns, each of which can be a different type (numeric, string, boolean, and so on). <br><br>
                        The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. In a data frame, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. 
<br><br>
                        While a DataFrame is physically two-dimensional, it can be used to represent higher dimensional data in a tabular format using hierarchical indexing
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

<a id="creatingDF"> </a>
### 1.1 Creating DataFrames

**1. Creating a data frame a dictionary**

**Note:** Like Series, the resulting DataFrame is assigned index automatically. And the 'Marks' values are in a tuple. 
**Note that every column of the data frame is a pandas Series.**

In [24]:
data = {'Subject': ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'],
        'Marks': (45, 65, 78, 65, 80, 78),
        'CGPA': [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]}

df = pd.DataFrame(data)
print(df)

    Subject  Marks  CGPA
0     Maths     45   2.5
1   History     65   3.0
2   Science     78   3.5
3   English     65   2.0
4  Georaphy     80   4.0
5       Art     78   4.0


**2. To create dataframe from series**

In [30]:
Subject = pd.Series(['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'])
Marks = pd.Series([45, 65, 78, 65, 80, 78])
CGPA = pd.Series([2.5, 3.0, 3.5, 2.0, 4.0, 4.0])

df = pd.DataFrame([Subject,Marks,CGPA], index = ['Subject','Marks','CGPA'])

print(df)
print()

# However to want a vertical dataframe so we use `.T`. The 'T' stands for transpose
print(df.T)

             0        1        2        3         4    5
Subject  Maths  History  Science  English  Georaphy  Art
Marks       45       65       78       65        80   78
CGPA       2.5      3.0      3.5      2.0       4.0  4.0

    Subject Marks CGPA
0     Maths    45  2.5
1   History    65  3.0
2   Science    78  3.5
3   English    65  2.0
4  Georaphy    80  4.0
5       Art    78  4.0


**3. To create dataframe from lists**

In [31]:
Subject = ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art']
Marks = [45, 65, 78, 65, 80, 78]
CGPA = [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]

pd.DataFrame([Subject,Marks,CGPA], index = ['Subject','Marks','CGPA']).T

Unnamed: 0,Subject,Marks,CGPA
0,Maths,45,2.5
1,History,65,3.0
2,Science,78,3.5
3,English,65,2.0
4,Georaphy,80,4.0
5,Art,78,4.0


**4. To read data from csv file**

In [36]:
data = pd.read_csv("datasets/example.csv")
type(data)

pandas.core.frame.DataFrame

In [37]:
# To print head of the data
data.head()

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5
3,26,65,1.21
4,68,50,1.32


**5. To obtain the dimension of the data**

In [38]:
data.shape

(23, 3)

**6. To know the data types of a data frame**

In [39]:
data.dtypes

Age                 int64
Weight (in kg)      int64
Height (in m)     float64
dtype: object

**7. To know some information of the data**

In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             23 non-null     int64  
 1   Weight (in kg)  23 non-null     int64  
 2   Height (in m)   23 non-null     float64
dtypes: float64(1), int64(2)
memory usage: 684.0 bytes


<a id="manipulatingDF"> </a>
### 2.2  Manipulating DataFrames 
#### Manipulating the Dataframes

### Add new column and rows

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> CAUTION:<br>
                        1. DataFrame[column] works for any column name, but DataFrame.column only works when the column name is a valid Python variable name.<br>
                        2. New columns cannot be created with the ` data.BMI ` syntax.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

**1. Adding a new column to the data set**

In [42]:
data["BMI"] = data["Weight (in kg)"] / data["Height (in m)"]**2
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45,60,1.35,32.921811
1,12,43,1.21,29.369579
2,54,78,1.5,34.666667
3,26,65,1.21,44.395875
4,68,50,1.32,28.696051
5,21,43,1.52,18.611496
6,10,32,1.65,11.753903
7,57,34,1.61,13.116778
8,75,23,1.24,14.958377
9,32,21,1.52,9.089335


In [43]:
data.shape

(23, 4)

**2. Adding a new row to the data set**

A new row can be added using the function copy()

In [44]:
data_copy = data.copy()
data_copy.loc[23] = [45, 85, 1.8, 26.3]

data_copy

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45.0,60.0,1.35,32.921811
1,12.0,43.0,1.21,29.369579
2,54.0,78.0,1.5,34.666667
3,26.0,65.0,1.21,44.395875
4,68.0,50.0,1.32,28.696051
5,21.0,43.0,1.52,18.611496
6,10.0,32.0,1.65,11.753903
7,57.0,34.0,1.61,13.116778
8,75.0,23.0,1.24,14.958377
9,32.0,21.0,1.52,9.089335


We see that a new column number 23 has be added to the data.<br><br>

**3. Indexing a dataframe using `.iloc`**

`DataFrame.iloc[]` method is used when the index label of a data frame is something other than numeric series of 0, 1, 2, 3….n or in case the user doesn’t know the index label. 

In [45]:
# Select the second row
data.iloc[2]

Age               54.000000
Weight (in kg)    78.000000
Height (in m)      1.500000
BMI               34.666667
Name: 2, dtype: float64

In [47]:
# Select 4th, 7th and 10th rows
# We use two square brackets since we are passing a list of row numbers to be accessed.
print(data.iloc[[4,7,10]], "\n\n")

# Select 12th to 17th rows
print(data.iloc[12:17])

    Age  Weight (in kg)  Height (in m)        BMI
4    68              50           1.32  28.696051
7    57              34           1.61  13.116778
10   23              53           1.50  23.555556 


    Age  Weight (in kg)  Height (in m)        BMI
12   55              89           1.65  32.690542
13   23              45           1.75  14.693878
14   56              76           1.69  26.609713
15   67              78           1.85  22.790358
16   26              65           1.21  44.395875


In [52]:
# Select the last column
data.iloc[:,-1]

0     32.921811
1     29.369579
2     34.666667
3     44.395875
4     28.696051
5     18.611496
6     11.753903
7     13.116778
8     14.958377
9      9.089335
10    23.555556
11    20.983988
12    32.690542
13    14.693878
14    26.609713
15    22.790358
16    44.395875
17    25.909457
18    22.790358
19    44.395875
20    28.696051
21    26.609713
22    22.790358
Name: BMI, dtype: float64

In [53]:
# Select the first two columns
data.iloc[:,0:2]

Unnamed: 0,Age,Weight (in kg)
0,45,60
1,12,43
2,54,78
3,26,65
4,68,50
5,21,43
6,10,32
7,57,34
8,75,23
9,32,21


**4. Selecting columns by specifying column names**

In [54]:
# Select the column 'Age' and 'BMI'
data[["Age","BMI"]]

Unnamed: 0,Age,BMI
0,45,32.921811
1,12,29.369579
2,54,34.666667
3,26,44.395875
4,68,28.696051
5,21,18.611496
6,10,11.753903
7,57,13.116778
8,75,14.958377
9,32,9.089335


**5. Sort the data frame on the basis of values in a column**

Each column of a pandas DataFrame is treated as a pandas Series. The `.sort_values()` in DataFrames works similar to the `pandas.Series`

In [56]:
# sort the data frame on basis of 'Age' values
# by default the values will get sorted in ascending order
data.sort_values('Age')

#Note: 'ascending = False' will sort the data frame in descending order

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
6,10,32,1.65,11.753903
1,12,43,1.21,29.369579
5,21,43,1.52,18.611496
13,23,45,1.75,14.693878
10,23,53,1.5,23.555556
19,26,65,1.21,44.395875
3,26,65,1.21,44.395875
16,26,65,1.21,44.395875
9,32,21,1.52,9.089335
11,34,65,1.76,20.983988


**6. Rank the dataframe**

We will see that 'BMI = 44.395875' is repeating thrice; thus the method = 'min' will assign the minimum rank (=1) to all the three values of BMI. The rank '4' will be assigned to the second largest value of BMI and so on. Thus, there is no rank equal to 2 and 3.

In [58]:
# rank the data frame 'data' in descending order based on 'BMI'
# 'method = min' assigns the minimum rank to highest equal value of 'BMI' 
data['BMI_ranked'] = data['BMI'].rank(ascending = 0, method  = 'min')
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI,BMI_ranked
0,45,60,1.35,32.921811,5.0
1,12,43,1.21,29.369579,7.0
2,54,78,1.5,34.666667,4.0
3,26,65,1.21,44.395875,1.0
4,68,50,1.32,28.696051,8.0
5,21,43,1.52,18.611496,18.0
6,10,32,1.65,11.753903,22.0
7,57,34,1.61,13.116778,21.0
8,75,23,1.24,14.958377,19.0
9,32,21,1.52,9.089335,23.0


In [60]:
'''
Here, dense method assigns minimum rank (=1) to minimum value (=9.089335) of the BMI. 
Rank 2 will be assigned to BMI value greater than min=9.089335 and so on. 
Thus, no rank is skipped in the dense method.
'''

# method = 'dense' assigns same rank to all the same BMI values
data['BMI_densed_rank'] = data['BMI'].rank(method = 'dense')
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI,BMI_ranked,BMI_densed_rank
0,45,60,1.35,32.921811,5.0,15.0
1,12,43,1.21,29.369579,7.0,13.0
2,54,78,1.5,34.666667,4.0,16.0
3,26,65,1.21,44.395875,1.0,17.0
4,68,50,1.32,28.696051,8.0,12.0
5,21,43,1.52,18.611496,18.0,6.0
6,10,32,1.65,11.753903,22.0,2.0
7,57,34,1.61,13.116778,21.0,3.0
8,75,23,1.24,14.958377,19.0,5.0
9,32,21,1.52,9.089335,23.0,1.0


**7. To check for missing values**

The function `.isnull` check whether the data is missing. The `sum()` sums the number of 'True' values in the column. The final output gives the number of missing values in each column.

Here, we see there are 2 missing values in the 'weight' column and one missing value in other columns.

In [61]:
data.isnull().sum()

Age                0
Weight (in kg)     0
Height (in m)      0
BMI                0
BMI_ranked         0
BMI_densed_rank    0
dtype: int64

<a id="pandas"> </a>
### 3. Pandas Basics

In [63]:
# NumPy and pandas are typically imported together.
# np and pd are conventional aliases.
import numpy as np
import pandas as pd

In [66]:
# Read in data from a .csv file.
dataframe = pd.read_csv('datasets/train.csv')

dataframe.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [68]:
## Calculations

# Calculate the mean of the Age column.
print("Mean: ", dataframe['Age'].mean())

# Calculate the maximum value contained in the Age column.
print("Max: ", dataframe['Age'].max())

# Calculate the minimum value contained in the Age column.
print("Min: ", dataframe['Age'].min())

# Calculate the standard deviation of the values in the Age column.
print("STD: ", dataframe['Age'].std())

Mean:  29.69911764705882
Max:  80.0
Min:  0.42
STD:  14.526497332334042


In [69]:
# Return the number of rows that share the same value in the Pclass column.
dataframe['Pclass'].value_counts()

Pclass
3    491
1    216
2    184
Name: count, dtype: int64

In [70]:
# The describe() method returns summary statistics of the dataframe.
dataframe.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [71]:
# Filter the data to return only rows where value in Age column is greater than 60
# and value in Pclass column equals 3.
dataframe[(dataframe['Age'] > 60) & (dataframe['Pclass'] == 3)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
280,281,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q
326,327,0,3,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,,S
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


In [72]:
# Create a new column called 2023_Fare that contains the inflation-adjusted
# fare of each ticket in 2023 pounds.
dataframe['2023_Fare'] = dataframe['Fare'] * 146.14
dataframe

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,2023_Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,1059.515000
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,10417.341462
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1158.159500
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,7760.034000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1176.427000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1899.820000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,4384.200000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,3426.983000
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,4384.200000


In [73]:
# Use iloc to access data using index numbers.
# Select row 1, column 3.
dataframe.iloc[1][3]

'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

In [74]:
# Group customers by Sex and Pclass and calculate the total paid for each group
# and the mean price paid for each group.
fare = dataframe.groupby(['Sex', 'Pclass']).agg({'Fare': ['count', 'sum']})
fare['fare_avg'] = fare['Fare']['sum'] / fare['Fare']['count']
fare

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare,Fare,fare_avg
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,Unnamed: 4_level_1
Sex,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,1,94,9975.825,106.125798
female,2,76,1669.7292,21.970121
female,3,144,2321.1086,16.11881
male,1,122,8201.5875,67.226127
male,2,108,2132.1125,19.741782
male,3,347,4393.5865,12.661633


In [75]:
# Create a copy of df3 named 'titanic'.
titanic = dataframe

# The head() method outputs the first 5 rows of dataframe.
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,2023_Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1059.515
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,10417.341462
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1158.1595
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,7760.034
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1176.427


In [76]:
# The columns attribute returns an Index object containing the dataframe's columns.
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', '2023_Fare'],
      dtype='object')

In [77]:
# The shape attribute returns the shape of the dataframe (rows, columns).
titanic.shape

(891, 13)

In [78]:
# The info() method returns summary information about the dataframe.
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
 12  2023_Fare    891 non-null    float64
dtypes: float64(3), int64(5), object(5)
memory usage: 90.6+ KB


In [79]:
# Use loc to access values in rows 0-3 at just the Name column.
titanic.loc[0:3, ['Name']]

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"


In [80]:
# Create a new column in the dataframe containing the value in the Age column + 100.
titanic['Age_plus_100'] = titanic['Age'] + 100
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,2023_Fare,Age_plus_100
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1059.515,122.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,10417.341462,138.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1158.1595,126.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,7760.034,135.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1176.427,135.0


<a id="masking"> </a>
### 4. Boolean Masking


In [81]:
# Instantiate a dictionary of planetary data.
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
       'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
       'moons': [0, 0, 1, 2, 80, 83, 27, 14]
        }
# Use pd.DataFrame() function to convert dictionary to dataframe.
planets = pd.DataFrame(data)
planets

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
4,Jupiter,69911,80
5,Saturn,58232,83
6,Uranus,25362,27
7,Neptune,24622,14


<a id="grouping"> </a>
### 5. Grouping and aggregation
