# Pandas
Pandas is an open-source data analysis and manipulation
library in Python. It provides data structures for efficiently
storing and manipulating large datasets, as well as tools for
reading and writing data from various sources. we'll go over the basics of pandas and introduce
some of their most important functions.

# 1. Pandas Series
A pandas Series is a one-dimensional array. Simply put, a
series has one column and an index. In pandas, we can
create a one-dimensional array using the Series method.

In [6]:
import pandas as pd

# create a Series

fruits = pd.Series(["Orange", "Apple", "Mango"], name="fruits")
fruits

0    Orange
1     Apple
2     Mango
Name: fruits, dtype: object

In the above code, we have made a one-dimensional array
using pandas. We passed a list ["Orange," "Apple," "Mango"]
as an argument to the function. The output on the far left
(0, 1, 2) is the index of the Series. And on the right is a column of the data in the Series. By default, the index will
start from (0, 1, 2,... n).

# 1.1 Series Index and Names
We can also specify the values of the index using the index
parameter. The specified index must be the same length as
the items (data) in the Series and does not have to be
unique (meaning you can pass the same index value for all
the data). In the example below, we specify the values of the
index. You can see in the output that the series is now (a, b,
c). We could have passed (c, c, c) as index values, and it
would still be valid. However, it is good practice to have
unique values for the index.

In [7]:
fruits = pd.Series(["Orange", "Apple", "Mango"], name="fruits", index = ['a', 'b', 'c'])
fruits

a    Orange
b     Apple
c     Mango
Name: fruits, dtype: object


The name of the Series is "fruits." This is the name that we
have passed to the name parameter as an argument. The
name parameter is optional.

# 1.2 Series Data Type
The dtype in the Series means "data type." By default, the
data type is inferred from the data in the Series. The data
type of our series above is object data type. If the series'
data is of the string type, the inferred data type will be
object. Because the items in the list are strings, as you can
see from our code, our data type is an object. If the data is
a mixture of different data types, e.g., strings and integers, then the data type will be "object." Let’s see what happens
when we create a series of integers.

In [8]:
int_numbers = pd.Series([10, 20, 30], name="numbers")
int_numbers

0    10
1    20
2    30
Name: numbers, dtype: int64

You can see from the output that the data type is now int64
(a 64-bit integer). This is because the data in our Series is
in integers. What happens when we mix integers with
strings? You can see below that the dtype has changed to
"object." By default, heterogeneous data will be interpreted
as object data.

In [9]:
int_str = pd.Series([10, 20, 30, 'Orange'], name="mixed")

int_str

0        10
1        20
2        30
3    Orange
Name: mixed, dtype: object

You can also specify the data type of the Series. The series
function has a dtype parameter where we can pass a data
type as an argument.

In [10]:
int_str = pd.Series([10, 20, 30], name="mixed", dtype="int8")

int_str

0    10
1    20
2    30
Name: mixed, dtype: int8

It is also possible to change the data type of a series after it
is created. This can be done using the astype() method.

In [11]:
num_series = pd.Series([10, 20, 30, 40])
num_series

0    10
1    20
2    30
3    40
dtype: int64

Now let's change the dtype of the Seriess to float64

In [12]:
num_series = num_series.astype("float64")
num_series

0    10.0
1    20.0
2    30.0
3    40.0
dtype: float64

# 2. Creating a Pandas DataFrame

To create a pandas DataFrame, you can pass a dictionary
or a list of lists as the argument to the DataFrame function.
The names of the columns are passed as a separate
argument to the columns parameter:


In [13]:
data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'],
         'Age': [25, 32, 18, 47], 
         'Salary': [50000, 80000, 20000, 120000]}
df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
df

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,32,80000
2,Charlie,18,20000
3,David,47,120000


You can also create a DataFrame from lists by passing them
to the DataFrame function as a dictionary.

In [14]:
names = ['Alice', 'Bob', 'Charlie', 'David']
age = [25, 32, 18, 47]
salary = [50000, 80000, 20000, 120000]

df = pd.DataFrame({"Names": names, "Age": age, "Salary": salary})
df

Unnamed: 0,Names,Age,Salary
0,Alice,25,50000
1,Bob,32,80000
2,Charlie,18,20000
3,David,47,120000


# 3. Data Loading Functions
Pandas provides several functions for loading data from
various sources:

## 3.1 read_csv()
This function is used to read data from a CSV file into a
pandas DataFrame. Since most tabular data is saved in this
format, this is one of the most commonly used functions in
pandas. It provides options to specify the delimiter,
encoding, header row, and more. Let’s say you have a file
called "Data" saved in CSV format. Here is how you would
open the file using the read_csv() function.

## 3.2 read_csv()
You can also use the pandas read_excel() files to read data
from an Excel file into a pandas DataFrame. It provides
options to specify the sheet name, header row, and more.

## 3.3 read_sql()
This function is used to read data from a SQL database into
a pandas DataFrame. It requires a connection to the
database and a SQL query.

In [21]:
# Reading data from a CSV file

df = pd.read_csv("sports_data.csv")

df

Unnamed: 0,Name,Sport,Goals,Assists,Fouls,Minutes Played,Yellow,Red Cards,Team
0,Alex,Basketball,4,5,8,36,1,0,A
1,Bob,Soccer,0,2,1,31,0,0,B
2,Charlie,Basketball,2,2,6,34,1,0,A
3,David,Soccer,1,7,6,9,0,0,B
4,Eve,Basketball,8,2,6,29,0,0,A
5,Frank,Soccer,0,6,5,35,1,0,B
6,George,Basketball,5,8,8,3,1,0,A
7,Harry,Soccer,9,6,8,13,1,0,B
8,Ivan,Basketball,7,9,8,30,1,0,A
9,Jack,Soccer,8,1,3,58,1,0,B


In [31]:
# Reading the data from Excel file

#df = pd.read_excel("Names.xlsx")
#df

# 4. Data Cleaning
Data comes in many forms. Before we can start analyzing
our data, we need to clean it up. Here are some of the most
important functions for data cleaning:
1. .dropna()
2. .fillna()

# 4.1 .dropna()
If we have NaN values in the DataFrame, we can drop them
using the df.dropna() method. Let’s create a DataFrame
with missing values and use the dropna() method to drop
the missing values.

In [None]:
names = ["Alice", "Bob", "Charlie", "David"]
age = [25, None, 18, 47]
Salary = [50000, 80000, None, 120000]

df = pd.DataFrame({"Names": names, "Age": age, "Salary": Salary})
df

In [33]:
# The NaN values in the DataFrame represent missing values. Now,
# to drop all rows with missing values with the dropna() method, here is how we do it:
df.dropna()


Unnamed: 0,Names,Age,Salary
0,Alice,25.0,50000.0
3,David,47.0,120000.0


We can also drop all columns with missing values. Columns
are on axis 1, so pass 1 to the axis parameter. This will drop
all the columns except for the "Names" column.

In [34]:
df.dropna(axis=1)

Unnamed: 0,Names
0,Alice
1,Bob
2,Charlie
3,David


# 4.2 .fillna()
This method is used to fill missing values in the DataFrame
with a specified value or method.

# 5. Data Manipulation in Pandas
Once data is loaded into a pandas DataFrame, there are
several methods for manipulating and transforming the
data.

1. head()
2. tail()
3. info()
4. describe()
5. groupby()
6. merge()
   

## 5.1 head()
This method returns the first n rows of a DataFrame. By
default, it returns the first five rows.

In [36]:
names = ['Alice', 'Bob', 'Charlie', 'David', 'John', 'Mpho', 'Steve', 'ben']
age = [25, 29, 33, 21, 57, 66, 50, 30]

# Creating a DataFrame
df = pd.DataFrame({"Names": names, "Age": age})

# View the first 5 rows

df.head()

Unnamed: 0,Names,Age
0,Alice,25
1,Bob,29
2,Charlie,33
3,David,21
4,John,57


In [38]:
# if you want to see only 2 Rows

df = pd.DataFrame({"Names": names, "Age": age})

df.head(2)

Unnamed: 0,Names,Age
0,Alice,25
1,Bob,29


## 5.2 tail()
The tail() method returns the last n rows of a DataFrame.
By default, it returns the last 5 rows.

In [39]:
df.tail()


Unnamed: 0,Names,Age
3,David,21
4,John,57
5,Mpho,66
6,Steve,50
7,ben,30


In [40]:
# show only last 3 Records

df.tail(3)

Unnamed: 0,Names,Age
5,Mpho,66
6,Steve,50
7,ben,30


## 5.3 info()
The info() method provides a summary of the DataFrame,
including the data types, number of non-null values, and
memory usage. You can see below that our DataFrame is
using 256 bytes of memory.

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Names   8 non-null      object
 1   Age     8 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 260.0+ bytes


## 5.4 describe()
If we want a summary of the statistics for each column in
the DataFrame, such as count, mean, standard deviation,
minimum, and maximum, we can use the describe()
method. When we use the describe() method on the
DataFrame, we can see the statistical summary of our
DataFrame. For example, we can see that the average age
of the people is 38.

In [42]:
df.describe()

Unnamed: 0,Age
count,8.0
mean,38.875
std,16.522171
min,21.0
25%,28.0
50%,31.5
75%,51.75
max,66.0


In [43]:
## 5.5 groupby()
