# **Pandas: A Comprehensive Guide**

## **Introduction**

Pandas is a powerful Python library for data manipulation and analysis. It provides flexible data structures such as `Series` and `DataFrame` to work efficiently with structured data.

### **Installation**

To install Pandas, use:

```bash
pip install pandas
```

Then, import it in Python  
```python
import pandas as pd
```

### **Core Data Structures**

#### **Series**

A series is a one-dimensional labeled array capable of holding any data type.

- ```python
    data = [10, 20, 30, 40]
    s = pd.Series(data, index=['A','B','C','D'])
    print(s)
    ```

- In Pandas Series, you can use any iterable for the data. This includes Lists, Tuples, NumPy Arrays, Dictionaries, Strings, Dictionaries, and Sets.

- ```python
    data = (10, 20, 30, 40)
    s = pd.Series(data, index=['A','B','C','D'])
    print(s)
    ```

- ```python
    import pandas as pd
    import numpy as np

    data = (1,2,3,4,5)
    index = np.array(['A','B','C','D','E'])
    s = pd.Series(data, index)
    print(s)
    ```

#### **DataFrame**

A dataframe is a two-dimensional labeled data structure, similar to a table in SQL or Excel.

- Creating a Pandas DataFrame using **list of dictionaries**:
    - Each dictionary represents a row, and the keys become the column labels.

- ```python
    s = pd.DataFrame([{'Reptiles':'scales', 'Birds':'feathers'}, {'Reptiles':'regenerate tail', 'Birds':'does not regenerate'}])
    print(s)
    ```

- Creating a Pandas DataFrame using **dictionary of lists**
    - Each key in the dictionary represents a column label, and the values are lists of column data.

- ```python
    s = pd.DataFrame({'Reptiles':['scales', 'regenerate tail'], 'Birds':['feathers', 'does not regenerate']})
    print(s)
    ```

- Creating a Pandas DataFrame using **list of lists (or Tuples)**
    - Each inner list or tuple represents a row of data, and an optional columns argument can be used to specify the column names.

- ```python
    data = [['a',2],['b',3],['c',4]]

    s = pd.DataFrame(data, columns=['alphabet','numbers'], index=[1,2])
    print(s)
    ```

- Creating a Pandas DataFrame using **single dictionary**
    - You can also pass a single dictionary with column names as keys and the corresponding values as lists.

- ```python
    data = {'alphabet':['a','b','c','d'], 'number':[1,2,3,4]}
    s = pd.DataFrame(data)
    print(s)
    ```

- Creating a Pandas DataFrame using **Numpy Array**
    - You can create a DataFrame by passing a numpy array and defining column labels.

- ```python
    import numpy as np
    import pandas as pd

    data = np.array([['a',1],['b',2]])
    s = pd.DataFrame(data, columns=['alphabet', 'number'])
    print(s)
    ```

- Creating a Pandas DataFrame from **CSV/Excel files**
    - You can read data from external files (e.g., CSV, Excel) into a DataFrame using Pandas' built-in functions.

- ```python
    data = pd.read_csv('data.csv')
    ```


### **Reading and Writing Data**

- Reading CSV Files
    ```python
    df = pd.read_csv('data.csv')
    print(df.head())
    ```

- Writing to CSV Files
    ```python
    df.to_csv('data.csv', index=False)
    ```

### **Basic Data Exploration**

#### **Viewing Data**

```python
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.info()) # Summary of the dataset (number of columns, data type of each column, non-null count, etc.)
print(df.describe()) # Descriptive Statistical summary (mean, median, etc.)
```

#### **Selecting Columns**

```python
print(df['Column Name']) # Single column; returns a pandas Series
print(df[['Column Name', 'Column Name2']]) # Multiple columns; returns a pandas DataFrame
```

#### **Selecting Rows**

##### **.loc - Label-based Indexing**
- The `.loc` indexer is used when you want to select data by label (i.e., by row and column names). It *includes* the end point in a range, which is different from the standard Python slicing.

```python
df.loc[row_label, column_label]
```

- row_label: The label (or index) of the row you want to select. Note, this is not the position of the index. It is the name of the index.
- column_label: The label (or name) of the column you want to select

- **Selecting a specific row by label**:
```python
df.loc[2] # selections row with label/index 2
```

- **Selecting specific rows and columns**:
```python
df.loc[2, 'Salary'] # Selects the 'Salary' value in row with label 2
df.loc[1:3, ['Name', 'Salary']] # Selects the 'Name' and 'Salary' values for rows with labels 1, 2, 3
```

- **Selecting all rows for specific columns**:
```python
df.loc[:, 'Salary'] # Selects the entire 'Salary' column
df.loc[:, ['Name', 'Salary']] # Selects the entire 'Name' and 'Salary' column
```

- **Conditional selection**:
```python
df.loc[df['Salary']>50000] # Selects rows where Salary is greater than 50000
df.loc[df['Age']<30, 'Name'] # Selects the 'Name' of employees younger than 30
df.loc[df['Name'].isin(['Eric', 'Leah', 'Dahlia'])]['Age'] # Select the 'Age' of people whose names are 'Eric', 'Leah', or 'Dahlia'
```

##### **.iloc - Positional Indexing**
- The `.iloc` indexer is used for integer-location based indexing. It selects rows and columns by their integer positions, similar to how you would slice a list (excludes the last index).

```python
df.iloc[row_index, column_index]
```

- row_index: The index position (integer) of the row you want to select
- column_index: The index position (integer) of the column you want to select

- **Selecting a specific row by position**:
```python
df.iloc[2] # Selects the third row (position 2)
```

- **Selecting specific rows and columns by position**:
```python
df.iloc[1:3, 0:2] # Selects rows at position 1 and 2, and columns at positions 0 and 1
```

- **Selecting all rows for specific columns by position**:
```python
df.iloc[:, 1] # Selects the second column (position 1)
df.iloc[:, [0,2]] # Selects the first and third columns (position 0 and 2)
```

- **Conditional selection using `.iloc` along with `.values`**:
- When using `.iloc` with a boolean condition, you need to convert the condition to a numpy array of booleans (e.g., using `.values`), so it can work with the positional indexing.

```python
df.iloc[df['Salary'].values > 50000] # Select rows where Salary > 50000 by position
```

#### **More Advanced Use Cases**

- **Slicing rows and columns using `.loc`**
```python
df.loc[1:4, 'Name':'Salary'] # Selects rows from index 1 to 4, columns from 'Name' to 'Salary'
```

- **Selecting rows based on boolean conditions**
```python
df.loc[df['Age'] > 30, ['Name', 'Salary']] # Selects the 'Name' and 'Salary' for rows where 'Age' > 30
```

- **Modifying Values using `.loc`**
```python
df.loc[df['Age'] > 30, 'Salary'] = 60000 # Updates the 'Salary' of employees older than 30
```

- **Using `.iloc` for selecting random rows and columns**
```python
df.iloc[0:5, 1:3] # 
```

### **Data Cleaning**

- **Handling Missing Values**
```python
df.dropna(subset=['Salary', 'Age'], inplace=True) # Remove rows where any of these columns have missing values
```
```python
df.fillna('T-rex', inplace=True) # Fills the missing values with 'T-rex'. Because Pandas DataFrame typically prefers having the same data type across all rows in a column, filling the missing rows with a string value for columns that are numeric will convert the data type of that column to 'object'
```
```python
df['Salary'].fillna(100000, inplace=True) # Fills the missing value of the 'Salary' column with 100000
```
- **Rename Columns**
- **Changing Data Types**


### **Data Transformation**

- **Using ```.apply()```**
    - In pandas, the ```.apply()``` method is a powerful way to apply a function to each element in a Series or to each row or column in a DataFrame.

    ```python
    df['Name'] = df['Name'].apply(lambda x: x.upper()) # Converts all names to uppercase
    ```

    ```python
    df['Full Name'] = df.apply(lambda row: row['First Name'] + ' ' + row['Last Name'], axis=1) # Combines two columns into a new one by applying a function to each row. Axis=1 tells pandas to apply the function across columns for each row
    ```

### **.groupby()**

- ```df.groupby([...])``` tells pandas to split the DataFrame into groups based on column values. Each column contains rows with the same values for the specified columns
```python
df.groupby(['School', 'Major', 'Residence'])
```
- In the example above, pandas groups the data by the combination of the columns. For instance, all rows with ('McGill', 'Economics', 'Gardener') are grouped into one whereas all rows with ('UT Austin', 'Data Science', 'Non-resident') are grouped into another. 
- It returns a DataFrameGroupBy object - not a DataFrame, but an iterable object that represents groups of rows.
    - We can think of this as a dictionary-like structure where:
        - Each **key** is a tuple of group values like ('McGill', 'Economics', 'Gardener')
        - Each **value** is a mini DataFrame containing the rows that match those values
- ```.groupby([...])``` is typically used for:
    - Aggregation: ```.sum()```, ```.mean()```, ```.count()```
    - Transformation: ```.transform()``` to modify values but keep original shape
    - Custom operations: ```.apply()``` to do anything we want with each group

## Pandas Series Exercises

In [5]:
import pandas as pd
import numpy as np

In [6]:
# 1. Write a Pandas program to create and display a one-dimensional array-like object containing an array of data using Pandas module.

list = [1,2,3,4,5]
series1 = pd.Series(list)
print(series1)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [7]:
# 2. Write a Pandas program to convert a Panda module Series to Python list and it's type.

series1 = pd.Series([1,2,3,4,5])
list1 = series1.to_list()
print(list1)

[1, 2, 3, 4, 5]


In [8]:
# 3. Write a Pandas program to add, subtract, multiple and divide two Pandas Series.

series1 = pd.Series([2, 4, 6, 8, 10])
series2 = pd.Series([1, 3, 5, 7, 9])

add = series1 + series2
subtract = series1 - series2
multiply = series1 * series2
divide = series1 / series2
print(add, subtract, multiply, divide)

0     3
1     7
2    11
3    15
4    19
dtype: int64 0    1
1    1
2    1
3    1
4    1
dtype: int64 0     2
1    12
2    30
3    56
4    90
dtype: int64 0    2.000000
1    1.333333
2    1.200000
3    1.142857
4    1.111111
dtype: float64


In [9]:
# 4. Write a Pandas program to compare the elements of the two Pandas Series.

series1 = pd.Series([2, 4, 6, 8, 10])
series2 = pd.Series([1, 3, 5, 7, 10])

for i in range(len(series1)):
  if series1[i]>series2[i]:
    print('series 1 is greater than series 2')
  elif series1[i] == series2[i]:
    print('series 1 is equal to series 2')
  else:
    print('series 2 is greater than series 1')

series 1 is greater than series 2
series 1 is greater than series 2
series 1 is greater than series 2
series 1 is greater than series 2
series 1 is equal to series 2


In [10]:
# 5. Write a Pandas program to convert a dictionary to a Pandas series.

dict = {'a': 100, 'b': 200, 'c': 300, 'd': 400, 'e': 800}
series1 = pd.Series(dict)
print(series1)

a    100
b    200
c    300
d    400
e    800
dtype: int64


In [11]:
# 6. Write a Pandas program to convert a NumPy array to a Pandas series.

array1 = np.array([10,20,30,40,50])
series1 = pd.Series(array1)
print(series1)

0    10
1    20
2    30
3    40
4    50
dtype: int32


In [12]:
# 7. Write a Pandas program to change the data type of given a column or a Series.

# series1.astype('float') will not work because 'python' cannot be converted to a number
# to_numeric() is specific to pandas only

series1 = pd.Series(['1','2', 'python'])
series1 = pd.to_numeric(series1, errors='coerce') #coerce will return NaN for values that cannot be converted

print(series1)

0    1.0
1    2.0
2    NaN
dtype: float64


In [13]:
# 8. Write a Pandas program to convert the first column of a DataFrame as a Series.

d = {'col1': [1, 2, 3, 4, 7, 11], 'col2': [4, 5, 6, 9, 5, 0], 'col3': [7, 5, 8, 12, 1, 11]}

df = pd.DataFrame(d)
series1 = df['col1']
print(series1)

0     1
1     2
2     3
3     4
4     7
5    11
Name: col1, dtype: int64


In [14]:
# 9. Write a Pandas program to convert a given Series to an array.

# series.values return a numpy ndarray
series1 = pd.Series([1,2,3,4,5])
array = series1.values
print(array)

[1 2 3 4 5]


In [15]:
# 10. Write a Pandas program to convert Series of lists to one Series.

s = pd.Series([
    ['Red', 'Green', 'White'],
    ['Red', 'Black'],
    ['Yellow']])

new_list = []

for element in s:
  for item in element:
    new_list.append(item)
sa = pd.Series(new_list)
print(sa)

0       Red
1     Green
2     White
3       Red
4     Black
5    Yellow
dtype: object


In [16]:
# 11. Write a Pandas program to sort a given Series.

series1 = pd.Series([10,3,6,4,5])
series1 = series1.sort_values(ascending = False, ignore_index = True)
print(series1)

0    10
1     6
2     5
3     4
4     3
dtype: int64


In [17]:
# 12. Write a Pandas program to add some data to an existing Series.

s = pd.Series(['100', '200', 'python', '300.12', '400'])
s1 = pd.Series(['microsoft', 'google'])
new = pd.concat([s,s1], axis = 0, ignore_index = True)

print(new)

0          100
1          200
2       python
3       300.12
4          400
5    microsoft
6       google
dtype: object


In [18]:
# 13. Write a Pandas program to create a subset of a given series based on value and condition.

s = pd.Series([100,200,300,400])
new = s[s<200]
print(new)

0    100
dtype: int64


In [19]:
# 14. Write a Pandas program to change the order of index of a given series.

series1 = pd.Series(['meta', 'microsoft', 'google'])
new_series = pd.Series(series1.to_list(), index=[2,1,0])
print(new_series)

2         meta
1    microsoft
0       google
dtype: object


In [20]:
# 15. Write a Pandas program to create the mean and standard deviation of the data of a given Series.

series1 = pd.Series([1,2,3,4,5,6,7,8,9,10])
mean = series1.mean()
sd = series1.std()

print(mean, round(sd,2))

5.5 3.03


In [21]:
# 16. Write a Pandas program to get the items of a given series not present in another given series.

series1 = pd.Series([1,2,3,4,5])
series2 = pd.Series([4,5,6,7,8])

print(series1[~series1.isin(series2)])

0    1
1    2
2    3
dtype: int64


In [22]:
# 17. Write a Pandas program to get the items which are not common of two given series.

series1 = pd.Series([1,2,3,4,5,6,7,8,9,10])
series2 = pd.Series([8,9,10,11,12,13,14,15])

new1 = series1[~series1.isin(series2)]
new2 = series2[~series2.isin(series1)]

new = pd.concat([new1, new2], ignore_index = True)

print(new)

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7     11
8     12
9     13
10    14
11    15
dtype: int64


In [23]:
# 18. Write a Pandas program to compute the minimum, 25th percentile, median, 75th, and maximum of a given series.

series1 = pd.Series([1,2,3,4,5,6,7,8,9,10])
max_num = series1.max()
min_num = series1.min()
twentyfifth = series1.quantile(q=0.25)

 

print(max_num, min_num, twentyfifth)

10 1 3.25


In [24]:
# 19. Write a Pandas program to calculate the frequency counts of each unique value of a given series.

series1 = pd.Series([1,1,1,2,3,4,5,6,6,7,7,7,7,7])
print(series1.value_counts())

7    5
1    3
6    2
2    1
3    1
4    1
5    1
Name: count, dtype: int64


In [25]:
# 20. Write a Pandas program to display most frequent value in a given series and replace everything else as 'Other' in the series.

num_series = pd.Series(np.random.randint(1, 5, [15]))
count = num_series.value_counts() #value_counts returns a Series
most_frequent = count.index[0]

num_series[num_series!=most_frequent] = 'Other'

print(num_series)

0         4
1     Other
2     Other
3     Other
4     Other
5         4
6         4
7         4
8     Other
9         4
10    Other
11        4
12    Other
13    Other
14        4
dtype: object


  num_series[num_series!=most_frequent] = 'Other'


In [26]:
# 21. Write a Pandas program to find the positions of numbers that are multiples of 5 of a given series.

num_series = pd.Series(np.random.randint(1, 10, 9))
num_series = num_series[num_series%5==0]

print(num_series.index)

Index([0, 7], dtype='int64')


In [27]:
# 22. Write a Pandas program to extract items at given positions of a given series.

series1 = pd.Series([1,2,3,4,4,3,1,2,3,5,4,3,3,5,2,3,4,4,6,4,2,1])
series1 = pd.to_numeric(series1, downcast = 'integer')
positions = [10, 15]

print(series1.take(positions)) #Note that this method does not necessarily return indices 10 and 15. Instead, it returns the items in the 10th and 15th rows.

10    4
15    3
dtype: int8
