# **Pandas DataFrame Exploration**

This notebook explores the functionalities of Pandas DataFrames and Series in Python. It demonstrates the creation, manipulation, and analysis of data using Pandas, covering various operations like:

# **Key Topics Covered**

1. **Series Creation and Operations:**
    - Creating Pandas Series from lists and dictionaries.
    - Basic arithmetic operations on Series.
    - Accessing elements in Series using index and slicing.
    - Calculating the mean of a Series.


2. **DataFrame Creation and Manipulation:**
    - Creating DataFrames from dictionaries with mixed data types.
    - Accessing columns and rows in DataFrames.
    - Adding and deleting columns in DataFrames.
    - Modifying index and accessing data by index or label.

3. **DataFrame Slicing:**
    - Slicing DataFrames using `.loc` (label-based indexing) and `.iloc` (integer-based indexing).
    - Filtering DataFrames based on conditions.
    - Selecting specific rows and columns using slicing.

4. **Loading and Exploring Datasets:**
    - Loading a CSV file (Titanic dataset) into a Pandas DataFrame using `pd.read_csv`.
    - Displaying basic information about the DataFrame (head, tail, shape, columns, info).
    - Generating descriptive statistics of the numerical columns using `.describe()`.

5. **Data Analysis with NumPy:**

    - Applying NumPy functions on DataFrame columns (which are Pandas Series).
    - Calculating the mean of numeric columns in a DataFrame.

# Code Examples

The notebook provides several code examples for each of the topics listed above.  These examples demonstrate how to perform various operations on Series and DataFrames, including:


- Creating Series from different data structures.
- Performing arithmetic and statistical operations.
- Slicing and filtering data.
- Loading and summarizing datasets.

## Titanic Dataset

The Titanic dataset is used to illustrate data loading and analysis.  Descriptive statistics are shown with `.describe()`, and basic information about the dataset is given with `.info()`

## Requirements

This notebook relies on the following libraries:

- `numpy`
- `pandas`

To run this notebook, ensure these libraries are installed in your Python environment.  You can install them using pip:


In [1]:
import numpy as np
import pandas as pd

# **Series**

In [3]:
series1 = pd.Series([1,2,3,5,4], index = ['a', 'b', 'c', 'd', 'e'])
series1

Unnamed: 0,0
a,1
b,2
c,3
d,5
e,4


In [5]:
dict1 = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
series2 = pd.Series(dict1)
series2

Unnamed: 0,0
a,1
b,2
c,3
d,4
e,5


In [6]:
series1 + series1

Unnamed: 0,0
a,2
b,4
c,6
d,10
e,8


In [7]:
series1*series1

Unnamed: 0,0
a,1
b,4
c,9
d,25
e,16


In [10]:
series1["a"]

np.int64(1)

In [12]:
series1[0:3]

Unnamed: 0,0
a,1
b,2
c,3


In [14]:
np.mean(series1)  # numpy array functions generally work on series

np.float64(3.0)

# **DataFrames**

In [18]:
# dataframe can be created from a dictionary. Let's try that.

dict2 = {'name': ['John Wick', 'Joe Biden', 'Trump'],
         'gender': 'M',
         'age': np.array([35, 90, 16]),
         'marriage': True}

df1 = pd.DataFrame(dict2)

# we can also use multiple data structures for values in the columns

In [17]:
df1

Unnamed: 0,name,gender,age,marriage
0,John Wick,M,35,True
1,Joe Biden,M,90,True
2,Trump,M,16,True


In [19]:
df1['age']

Unnamed: 0,age
0,35
1,90
2,16


In [20]:
df1.name

Unnamed: 0,name
0,John Wick
1,Joe Biden
2,Trump


In [21]:
df1['country'] = 'US'
df1

Unnamed: 0,name,gender,age,marriage,country
0,John Wick,M,35,True,US
1,Joe Biden,M,90,True,US
2,Trump,M,16,True,US


In [23]:
del df1['country']
df1

Unnamed: 0,name,gender,age,marriage
0,John Wick,M,35,True
1,Joe Biden,M,90,True
2,Trump,M,16,True


In [24]:
df1['country'] = 'US'
df1

Unnamed: 0,name,gender,age,marriage,country
0,John Wick,M,35,True,US
1,Joe Biden,M,90,True,US
2,Trump,M,16,True,US


# **Slicing Dataframes**

In [27]:
# let's grab data from the dataframe based on particular rows and columns.

df1.loc[df1['name'] == 'Trump']

Unnamed: 0,name,gender,age,marriage,country
2,Trump,M,16,True,US


In [31]:
df1[df1['name'] == 'Trump']

Unnamed: 0,name,gender,age,marriage,country
2,Trump,M,16,True,US


In [35]:
df1.loc[df1['name'] == 'Trump', 'age':'country']

Unnamed: 0,age,marriage,country
2,16,True,US


In [36]:
df1.loc[df1['name'] == 'Trump', 'gender':'country']

Unnamed: 0,gender,age,marriage,country
2,M,16,True,US


In [37]:
df1 = pd.DataFrame(dict2, index = dict2['name'])
df1

Unnamed: 0,name,gender,age,marriage
John Wick,John Wick,M,35,True
Joe Biden,Joe Biden,M,90,True
Trump,Trump,M,16,True


In [40]:
df1.loc['Trump']

Unnamed: 0,Trump
name,Trump
gender,M
age,16
marriage,True


In [43]:
df1.loc['John Wick':'Trump', 'age':'marriage']

Unnamed: 0,age,marriage
John Wick,35,True
Joe Biden,90,True
Trump,16,True


In [44]:
# Let's access data using index

df1.iloc[0]

Unnamed: 0,John Wick
name,John Wick
gender,M
age,35
marriage,True


In [46]:
df1.iloc[0:2, 1:3]  # this will give data for the first 2 rows and columns 2nd and 3rd

Unnamed: 0,gender,age
John Wick,M,35
Joe Biden,M,90


In [47]:
df1[df1['age'] > 30]

Unnamed: 0,name,gender,age,marriage
John Wick,John Wick,M,35,True
Joe Biden,Joe Biden,M,90,True


# **Loading Dataset**

In [49]:
df1 = pd.read_csv('/content/Titanic-Dataset.csv')
df1.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [50]:
df1.shape

(891, 12)

In [53]:
df1.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [54]:
df1.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [57]:
df1.describe()  # gives information about numeric columns in the dataset

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Many functions of numpy arrays will work on dataframe columns as the columns are pandas series

In [61]:
# this will give average of every column

numeric_df = df1.select_dtypes(include = np.number)
np.mean(numeric_df, axis = 0)

Unnamed: 0,0
PassengerId,446.0
Survived,0.383838
Pclass,2.308642
Age,29.699118
SibSp,0.523008
Parch,0.381594
Fare,32.204208


In [62]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
