In [1]:
## !pip install pandas

# Pandas Basics

> Pandas is a widely-used Python library for data manipulation and analysis, centred around tabular data structures. It offers powerful and flexible data structures, such as `DataFrame`s and `Series`, to enable efficient data wrangling, cleaning, and analysis.

### Key Features of Pandas

Pandas has a wide range of features for viewing and manipulating tabular data:

- **Data Exploration and Analysis**: It provides functions for calculating descriptive statistics, summarising data, and identifying trends or patterns
- **Graphing and Visualization**: It has integrations with libraries like Matplotlib for creating informative plots and charts directly from `DataFrame`s
- **Data Cleaning and Preparation**: Tools for handling missing data, detecting outliers, and preparing datasets for analysis or machine learning models
- **Data Transformation**: Capabilities for reshaping, pivoting, and transforming datasets to suit specific analysis needs
- **File Compatibility and I/O**: Efficiently reads and writes to a variety of file formats, such as CSV, Excel, and SQL databases, facilitating easy data import and export

## Practical Applications

Pandas is used in various applications within data science and related fields:

- **Data Cleaning and Preparation**: It streamlines the process of cleaning raw data, handling missing values, and preparing datasets for analysis
- **Exploratory Data Analysis (EDA)**: Offers tools for deep-diving into datasets, uncovering insights, and informing subsequent analysis or modelling decisions
- **Data Transformation and Aggregation**: Essential for reshaping data, creating summary tables, and performing group-wise operations for data analysis
- **Data Visualisation**: Aids in creating meaningful visual representations of data, crucial for reporting and storytelling in data science

Pandas' functionality is not only robust but also relatively user-friendly, making it a go-to tool for data professionals seeking to perform efficient and effective data analysis.

## Installation

Pandas can be installed directly via `pip` using the `pip install pandas` command. Alternatively you can create a Conda environment and then use `conda install pandas` inside the activated environment.

You can check that Pandas is installed correctly using the following command in your CLI:

`python -c "import pandas as pd; print(pd.__version__)"`

This should return the version of Pandas that you have installed.

## Pandas Data Structures

> Pandas introduces two primary data structures: `Series` and `DataFrame`. A `Series` is a one-dimensional array-like object, akin to a column in a spreadsheet, while a `DataFrame` is a two-dimensional table with rows and columns, similar to a whole spreadsheet. Both are built on top of `array` objects from the NumPy library, which are multidimensional arrays intended for numerical data. Pandas extends their capabilities with more functionality and a focus on using a wider variety of data types. Pandas `Series` objects are ideal for single-dimensional data representation, whereas `DataFrames` are more suited for complex, multi-dimensional data analysis.  

Each of these data structures have a wide range of powerful built-in methods that allow you to perform all sorts of data tasks quickly and efficiently.


###  Pandas `DataFrame`


>A `DataFrame` (often abbreviated to `df`) is like a table or an Excel spreadsheet. It's a 2-dimensional array used to store and work with data. It has rows and columns, where each column can have a different type of data, like numbers or text. It's great for analysing and organizing tabular data. Like other Python data structures, a `DataFrame` is an object with a range of associated methods, and it is these methods that make the `DataFrame` a powerful tool for analysis.

Let's take a look at an example `DataFrame`. We will define a simple Python dictionary and then convert it into a `DataFrame` using the `pd.Dataframe()` function. You can then view the first few rows using the `DataFrame`'s `.head()` method, optionally passing a number of rows as an argument:











In [2]:
import pandas as pd


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 701, in start
    se

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 701, in start
    se

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 701, in start
    se

AttributeError: _ARRAY_API not found

In [3]:
data =  {
    'name': ['Gabriel', 'Angela', 'Daniel'],
    'age': [25, 40, 28],
    'location': ['London', 'Gboko', 'Yaounde']
}

In [4]:
print(data)

{'name': ['Gabriel', 'Angela', 'Daniel'], 'age': [25, 40, 28], 'location': ['London', 'Gboko', 'Yaounde']}


In [5]:
example_df = pd.DataFrame(data)

In [6]:
example_df

Unnamed: 0,name,age,location
0,Gabriel,25,London
1,Angela,40,Gboko
2,Daniel,28,Yaounde


We can see that the `pd.DataFrame` constructor has taken our dictionary and produced a table with the keys as column names and the value lists as the data for each column. It has also added an extra column on the left. This is known as the `index` and is used to identify a specific row.
### Pandas `Series`
> A `Series` is like a single column of a `DataFrame`, a 1-dimensional array that can hold data of any type. 

 We can create a `Series` from a list as follows:

In [7]:
my_list = [1, 2, 3]

In [8]:
example_series = pd.Series(my_list)

In [9]:
example_series

0    1
1    2
2    3
dtype: int64

In [10]:
my_list2 = ['Ade', 'George', 'Charles']

In [11]:
example_series2 = pd.Series(my_list2)

In [12]:
example_series2

0        Ade
1     George
2    Charles
dtype: object

## Indexing and Slicing Pandas Data Structures

>In Pandas, indexing and slicing enable precise data selection and manipulation within `Series` and `DataFrame`s. This functionality is akin to accessing elements in Python lists or arrays, but with enhanced capabilities. Indexing allows for selecting specific rows or columns using labels or positions, while slicing facilitates retrieving subsets of data. 

A `Series` can be indexed exactly like a Python list:



In [13]:
example_series2[0]

'Ade'

In [14]:
col_1 = example_df['name']
print(col_1)

0    Gabriel
1     Angela
2     Daniel
Name: name, dtype: object


In [15]:
print(type(col_1))

<class 'pandas.core.series.Series'>


In [16]:
col_2 = example_df[['name']]
print(col_2)

      name
0  Gabriel
1   Angela
2   Daniel


In [17]:
print(type(col_2))

<class 'pandas.core.frame.DataFrame'>


### The `.loc` Attribute
We can index a specific row of a `DataFrame` using the locate row (`.loc`) attribute. This selects rows based on the values in the index column. Note the syntax: just use the row index in square brackets `[]`, no parentheses `()`.

In [18]:
example_df.head()

Unnamed: 0,name,age,location
0,Gabriel,25,London
1,Angela,40,Gboko
2,Daniel,28,Yaounde


In [19]:
example_df.loc[0]

name        Gabriel
age              25
location     London
Name: 0, dtype: object

In [20]:
print(type(example_df.loc[0]))

<class 'pandas.core.series.Series'>


In [21]:
example_df.loc[[0, 2]]

Unnamed: 0,name,age,location
0,Gabriel,25,London
2,Daniel,28,Yaounde


In [22]:
example_df.loc[[2, 1]]

Unnamed: 0,name,age,location
2,Daniel,28,Yaounde
1,Angela,40,Gboko


### The `iloc` Attribute
Currently our index column values are the integers [`0`, `1`, `2`...], as per Python. However we can also use any other list of unique values as our index:

In [23]:
example_df

Unnamed: 0,name,age,location
0,Gabriel,25,London
1,Angela,40,Gboko
2,Daniel,28,Yaounde


In [24]:
example_df.index = ['i', 'ii', 'iii']

In [25]:
example_df

Unnamed: 0,name,age,location
i,Gabriel,25,London
ii,Angela,40,Gboko
iii,Daniel,28,Yaounde


The `loc` attribute is for **label-based indexing**. Using `loc` will work as before with the new index values:

In [26]:
example_df.loc['i']

name        Gabriel
age              25
location     London
Name: i, dtype: object

But we now have a mismatch between the Pythonic indexing and our index column. Under these circumstances we can choose to use the `iloc` attribute to use **position based indexing** instead:

In [27]:
example_df.iloc[0]

name        Gabriel
age              25
location     London
Name: i, dtype: object

### Slicing Rows and Columns Together

As well as selecting individual rows and columns, we can index or slice by both at once:


#### Using `loc`:

In [28]:
example_df

Unnamed: 0,name,age,location
i,Gabriel,25,London
ii,Angela,40,Gboko
iii,Daniel,28,Yaounde


In [29]:
example_df.loc['i', ['name', 'age']]

name    Gabriel
age          25
Name: i, dtype: object

#### Using `iloc`:

In [30]:
example_df

Unnamed: 0,name,age,location
i,Gabriel,25,London
ii,Angela,40,Gboko
iii,Daniel,28,Yaounde


In [31]:
example_df.iloc[0, [0, 1]]

name    Gabriel
age          25
Name: i, dtype: object

In [32]:
example_df.iloc[2, [2, 0]]

location    Yaounde
name         Daniel
Name: iii, dtype: object

## Resetting the Index

> Resetting the index of a Pandas `DataFrame` is often used after performing data manipulations like slicing or filtering, which can leave the `DataFrame` with an index that is non-sequential or not aligned with the data's current state. 

### Why Reset the Index?

- **Non-Sequential Indices:** After slicing or filtering a `DataFrame`, the resulting index might be non-sequential or non-contiguous, which can be confusing and may cause issues with data alignment or further data manipulation
- **Alignment and Consistency:** Resetting the index ensures that the `DataFrame` maintains a consistent structure, with a sequential numeric index starting from 0
- **Ease of Merging and Joining:** A standard, sequential index is often easier to work with when performing database-style join or merge operations


In [33]:
data = {
    'name': ['George', 'Akindoyin', 'Esther','Michael'],
    'age': [40, 30, 60, 80]
}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,George,40
1,Akindoyin,30
2,Esther,60
3,Michael,80


In [34]:
# where age is greater than 30
slice_df = df[df['age'] > 30]
slice_df

Unnamed: 0,name,age
0,George,40
2,Esther,60
3,Michael,80


In [35]:
reset_df = slice_df.reset_index(drop=True)
reset_df

Unnamed: 0,name,age
0,George,40
1,Esther,60
2,Michael,80


## Importing Data into Pandas 

> Pandas provides methods for importing various data types and loading them into a `DataFrame`. There are methods to read data from sources like `CSV`, Excel, `JSON`, and relational databases like SQL tables.
### Creating a `DataFrame` from Python Objects

As we saw earlier in the lesson, a `DataFrame` can be created from a dictionary of lists, where the keys are the column headings and the values are `lists` of column values.


In [36]:
data = {
    'name': ['George', 'Akindoyin', 'Esther','Michael'],
    'age': [40, 30, 60, 80]
}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,George,40
1,Akindoyin,30
2,Esther,60
3,Michael,80


You can also make a `DataFrame` from a list of lists, in which case you should supply the column names using the `columns` parameter.

By default, each list will represent a row of the `DataFrame`:

In [37]:
list_1 = [1, "Ikeja"]
list_2 = [2, "Gbagada"]
list_3 = [3, "Magodo"]

lagos_location_df = pd.DataFrame([list_1, list_2, list_3], columns=["sn", "locations"])

In [38]:
lagos_location_df

Unnamed: 0,sn,locations
0,1,Ikeja
1,2,Gbagada
2,3,Magodo


To create from a list of lists where each list is a column, you can use Python's `zip` function:

In [39]:
data = [
    [1, 2, 3],
    ["Ikeja", "Gbagada", "Magodo"],
    [23000, 40000, 15000]
]

location_census = pd.DataFrame(list(zip(*data)), columns=["id", "location", "population"])
location_census

Unnamed: 0,id,location,population
0,1,Ikeja,23000
1,2,Gbagada,40000
2,3,Magodo,15000


### Importing from `CSV`

One of the most common file types to read into Pandas is comma separated values (`CSV`). You can import data from a `CSV` file using the `pd.read_csv` function:

In [40]:
salaries = pd.read_csv('Salaries.csv')

In [41]:
print(type(salaries))

<class 'pandas.core.frame.DataFrame'>


In [42]:
salaries

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.00,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.60,,335279.91,335279.91,2011,,San Francisco,
3,AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.00,56120.71,198306.90,,332343.61,332343.61,2011,,San Francisco,
4,AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.60,9737.00,182234.59,,326373.19,326373.19,2011,,San Francisco,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
671,ZV,CHERYL ADAMS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.89,,180394.07,180394.07,2011,,San Francisco,
672,ZW,LOUISE SIMPSON,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.80,,180393.98,180393.98,2011,,San Francisco,
673,ZX,BLAKE LOEBS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.19,0.00,3537.75,,180393.94,180393.94,2011,,San Francisco,
674,ZY,ELIZABETH AGUILAR-TARCHI,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.17,0.00,3537.11,,180393.28,180393.28,2011,,San Francisco,


In [43]:
salaries.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


salaries.tail()

####Optional Parameters

- **`sep`**: Defines the delimiter to use. The default is `,`, and you can check your `CSV` file to see which should be used 
  
- **`header`**: Indicates the row number to use as column names (0-indexed). Default is `0` (first line), but can be set to `None`, in case you have no column names in your first row.
  
- **`index_col`**: This parameter is used to specify which column should be used as the row index. It can be an integer (column position) or a string (column label).
  
- **`usecols`**: Useful when you want to load only specific columns. Pass a list of column names or numbers.
  
- **`dtype`**: Dictates the data type for each column, and should be a list of the same length as the number of columns
#### Handling Common `CSV` Loading Issues

A number of factors can affect the proper reading of `CSV` files. Here are some example issues and how to solve them:


- **Different Delimiters:** `CSV` files may use delimiters other than commas (like tabs or semicolons). This will cause the DataFrame to be formatted incorrectly, or to throw an error during loading. Use the `sep` parameter to specify the delimiter, e.g. `pd.read_csv('file.csv', sep='\t')` for tab-delimited files.
  
- **Missing Headers:** If a `CSV` file doesn’t have a header row, set `header=None` to prevent the first row from being treated as column names. You can then assign column names using the `names` parameter.
  
- **Encoding Issues:** `CSV` files can have different encodings (like `UTF-8`, `Latin1`). If you encounter encoding errors, use the `encoding` parameter, e.g., `pd.read_csv('file.csv', encoding='latin1')`.
  
### Importing from Excel

Pandas provides the `pd.read_excel()` function to read Excel files. Here is an example load statement, using the first column as the index:




In [44]:
salaries2 = pd.read_excel('Salaries.xlsx')
salaries2.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


In [47]:
salaries3 = pd.read_excel('Salaries.xlsx', index_col=0)
salaries3.head()

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


In [50]:
salaries4 = pd.read_excel('Salaries.xlsx', sheet_name="Pora_Staff")
salaries4.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,AA,Sunday,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,AB,Akindoyin,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,Anthony,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,AD,Blessing,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


#### Optional Parameters

- `sheet_name`: Specifies which sheet to read. By default, it's set to `0`, meaning the first sheet. You can set it to the sheet's name or index.
  
- `header`: Identifies the row to use as column names
  
- `index_col`: Selects a column to use as the row index
  
- `usecols`: Limits the import to a list of specific columns
### Outputting `DataFrame` Data to a File
Pandas `DataFrame` objects have various methods for writing data from different file types. 

For example, to write data to a `CSV` file, you can use the `to_csv` method:


In [51]:
salaries4.to_csv('pora_staff.csv')

In [52]:
salaries4.to_csv('pora_staff_1.csv', index=False)

##  `DataFrame` Copies and Aliases

> It is important to understand that in Pandas, when you assign a `DataFrame` to a new variable, you are creating an **alias**, not a **copy**. This means both variables refer to the same underlying data.

In the code block below, you can see that if you assign a new variable name to a `DataFrame` and then change something in the new variable, you are changing the same underlying `DataFrame` as if you ran the manipulation on the old variable name. This is because both are **aliases** of the same object.

In [53]:
my_dict = {'Animal': ['Dog', 'Cat', 'Bird'], 'Age': [2, 4, 1]}
my_df = pd.DataFrame(my_dict)

In [54]:
my_df

Unnamed: 0,Animal,Age
0,Dog,2
1,Cat,4
2,Bird,1


In [55]:
new_df = my_df

In [56]:
new_df

Unnamed: 0,Animal,Age
0,Dog,2
1,Cat,4
2,Bird,1


In [57]:
new_df.iloc[0, 0] = 'Pig'

In [58]:
new_df

Unnamed: 0,Animal,Age
0,Pig,2
1,Cat,4
2,Bird,1


In [59]:
my_df

Unnamed: 0,Animal,Age
0,Pig,2
1,Cat,4
2,Bird,1


### Creating a Copy

A new `DataFrame` is generated as the output to most operations you perform on your original `df` however. For example:

my_new_df = my_df.sort_values(by='Age', ascending=False)
my_new_df.head() # The new DataFrame has been re-sorted

In [61]:
my_df.head() # The original DataFrame has not been changed

Unnamed: 0,Animal,Age
0,Pig,2
1,Cat,4
2,Bird,1


You can also make an unchanged copy of the `DataFrame` using the `.copy()` method:

In [62]:
my_new_df = my_df.copy()

### The `inplace` Parameter

> For some `DataFrame` operations, it is possible to perform the change on the existing `DataFrame` without making an explicit assignment, by using the argument `inplace = True`. The parameter causes the associated method to modify the original `DataFrame` directly, rather than returning a new `DataFrame` with the applied changes. When set to True, the operation occurs in place and the original `DataFrame` is altered, thereby eliminating the need to assign the result to a new variable.

Unfortunately, the implementation of this parameter is somewhat inconsistent across the library, so you might need to experiment, or refer to the documentation, to learn which methods will allow you to do this.

In [63]:
my_new_df.drop('Age', axis=1, inplace=True) # Drops the 'Age' column in-place
my_new_df.head()

Unnamed: 0,Animal
0,Pig
1,Cat
2,Bird


## Key Takeaways

- Pandas is a Python library for analysing and manipulating large tabular datasets
- Use Pandas inside a Jupyter Notebook for EDA and data cleaning
- A `DataFrame` is a 2D array-like table data structure
- A `Series` is a 1-dimensional array in pandas that can hold any type of data
- `Series` in pandas can be indexed like Python lists
- Use `loc` for label-based indexing and `iloc` for position-based indexing
- Pandas allows indexing or slicing of both rows and columns simultaneously using `loc` and `iloc`
- Resetting the index of a `DataFrame` is useful after data manipulations like slicing or filtering
- Pandas allows importing data from various sources like `CSV`, Excel, `JSON`, and SQL into a `DataFrame`
- Pandas can read `CSV` files using pd.read_csv, with optional parameters to handle issues like different delimiters, missing headers, and encoding issues
- The `pd.read_excel()` function reads Excel files, with options to specify sheet, header, index column, and specific columns
- The `to_csv()`, `to_json()` and `to_dict()` methods can be used to export data from a `DataFrame`