# **Chapter 1: Basic Python Usage for Data Exploration & Manipulation**

###  **Introduction**

This chapter explores some of the basic, everyday functions that I use in my role as a data analyst. This is a framework that is time-proven and effective in exploring and manipluting datasets. The following tools, libraries, and methodologies will get you started in performing exploratory data analysis on live data sets. 

## **1.1: Basic Import Statements**

```python
# Pandas - For Data Manipulation
import pandas as pd 
# Numpy - For Linear Algebra and Other Mathematics
import numpy as np 
# Matplotlib - For plotting graphs. 
import matplotlib.pyplot as plt 
```

## **1.2: Pandas DataFrame - Basics**

### **Creating a DataFrame**

This guide covers the most common operations you'll use with pandas DataFrames.
```python
# Import Statement for Pandas:
import pandas as pd

# Creating data that we will convert to a 
# Pandas DataFrame object.
data = {
    "Student": ["Alice", "Ben", "Carlos"],
    "GPA": [3.5, 3.8, 3.2],
    "Credits": [45, 60, 30]
}

df = pd.DataFrame(data)

# Displaying the df
display(df)
```

### **Inspecting Data**

```python
# View the first 5 rows
display(df.head())

# .head() defaults to 5 rows. 
# If you want to view the first x rows, just insert the number of rows you would like 
# to view into the parenthesis. For example, if you want to view the first 7 rows
# of the DataFrame, just execute the following command:
display(df.head(7))

# Get basic info about the DataFrame (Non Null counts per column and datatype of each column.)
print(df.info())

# Summary statistics
print(df.describe())
```


### **Selecting Data**

```python
# Select a column
print(df["GPA"])

# Select multiple columns
print(df[["Student", "Credits"]])
```

### **Selecting Data with iloc**

```python
# .iloc is not a column or a stored attribute of the DataFrame — it is an 
# indexer object that lets you select data by integer position (row and 
# column numbers).

# You always call it with square brackets [], and you can pass it one 
# or two arguments:

# General form:
df.iloc[row_index, column_index]

```

- Row index (first argument):

    - Single integer → returns a row.

    - Slice (`0:3`) → returns multiple rows.

    - List of integers ([0, 2]) → returns multiple rows.

- Column index (second argument, optional):

    - Works the same way as row index but applies to columns.

    - If you leave it out, you get the whole row.

**Examples:**

```python
print(df.iloc[0])               # First row (row index 0)
print(df.iloc[0, 1])            # First row, second column (integer positions)
print(df.iloc[0:2, 0:2])        # First two rows and first two columns
print(df.iloc[[0, 2], [1, 2]])  # Specific rows and columns
``` 

Selecting Data from Pandas DataFrame objects:

```python
# Select by integer location (iloc)
# iloc uses "integer-location based indexing"
# It only works with row/column numbers (not labels).
# Remember: Python uses 0-based indexing.
print(df.iloc[0])        # Entire first row (row index 0)
print(df.iloc[0, 1])     # Value at first row, second column (GPA of Alice)
```


### **Select by label with loc**


```python
# loc uses labels instead of integer positions.
print(df.loc[0, "Student"])  # Student name in row with index label 0

```

### **Filtering Data**

If we want to filter a Pandas DataFrame on a condition, we must first define that condition. For example, let's say that we want to filter a DataFrame of students down to only students that have a GPA greater than 3.5. 

The contition, in pseudocode, would be `df[GPA] > 3.5`. We can define the condition as follows, in Python: 

```python
condition = df['GPA'] > 3.5
```
Now, to filter the DataFrame by that position, we can simply insert it as a "Column Name" into the DataFrame. And Pandas will know to only return rows where that condition is true: 
```python
# Displaying the rows with a GPA > 3.5
display(df[condition])
```
We can also filter on multiple conditions. For instance, let's say we want to filter the DataFrame to rows that only have a GPA greater than 3.5 (out last condition) **and** where credits are > 50.

To accomplish this, you simply have to perform the following steps: 

**Step 1**: Define your 2 conditions
```python
gpa_condition = df['GPA'] > 3.5
credits_condition = df['Credits'] > 50
```
**Step 2** Insert your conditions in square brackets, immediately after your dataframe object. Make sure to seperate your conditions by `&`. 
```python
# Displaying rows with GPA > 3.5 and credits > 50 
display(df[gpa_condition & credits_condition])
```

Side Note: `Logical operators` in Python will be discussed later in these notes. They're pretty easy to learn and you can do some very cool things with them! 

### **Adding and Modifying Columns**

To add a new column of a DataFrame, you simply define your new column name in square brackets [] on the DataFrame and assign values to it.

```python
# Add a new column
df["Graduated"] = [False, False, False]

# Modify an existing column
df["GPA"] = df["GPA"] * 1.05
```


### **Sorting Data**

```python
# Sort by GPA descending
print(df.sort_values(by="GPA", ascending=False))
```


### **Grouping and Aggregating**

```python
# Example with grouping (pretend we have majors)
df["Major"] = ["Math", "Math", "History"]
print(df.groupby("Major")["GPA"].mean())
```

### **Saving and Loading Data**

```python
# Save to CSV
df.to_csv("students.csv", index=False)
# Note: Index=False will not add an index (row counting) row to your export.

# Load from CSV
df2 = pd.read_csv("students.csv")
print(df2)

# Save to Excel
df.to_excel("students.xlsx", index=False)

# Load from Excel
df2 = pd.read_excel("students.xlsx")
print(df2)
```

### **Finding the Size (Number of Columns / Rows) of a df**

```python
# Shape of the DataFrame (rows, columns)
display(df.shape)      # e.g. (3, 3)

# Number of rows
display(len(df))       # e.g. 3
display(df.shape[0])   # also gives 3

# Number of columns
display(len(df.columns))  # e.g. 3
display(df.shape[1])      # also gives 3
```

## **1.3: Python Functions for Data Manipulation**

Python provides built-in functions and external libraries (like NumPy) to help with data 
manipulation and basic statistical analysis. Below are some of the most common tools 
you'll use.

### Built-in Functions

#### Summation and Length

```python
numbers = [2, 4, 6, 8, 10]

print(sum(numbers))  # 30
print(len(numbers))  # 5
```

#### Minimum and Maximum

```python
print(min(numbers))  # 2
print(max(numbers))  # 10
```

#### Average (using built-in)

```python
average = sum(numbers) / len(numbers)
print(average)  # 6.0
```

### Using the `statistics` Module

Python’s built-in `statistics` library provides more descriptive statistics.

```python
import statistics as stats

data = [1, 2, 2, 3, 4, 7, 9]

print(stats.mean(data))      # Average
print(stats.median(data))    # Middle value
print(stats.mode(data))      # Most common value
print(stats.stdev(data))     # Standard deviation
```


### Using NumPy

NumPy is a powerful library for numerical computing. It is often used for 
summary statistics and array manipulations.

```python
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

print(np.mean(arr))   # Average
print(np.median(arr)) # Median
print(np.std(arr))    # Standard deviation
print(np.var(arr))    # Variance
```

### Five Number Summary (NumPy)

The five number summary includes: **minimum, Q1, median, Q3, maximum**.

```python
data = np.array([7, 8, 5, 6, 3, 4, 9, 2, 1])

five_num = {
    "min": np.min(data),
    "Q1": np.percentile(data, 25),
    "median": np.median(data),
    "Q3": np.percentile(data, 75),
    "max": np.max(data)
}

print(five_num)
```

### Rounding and Aggregation

```python
values = [3.14159, 2.71828, 1.61803]

# Round to 2 decimal places
rounded = [round(v, 2) for v in values]
print(rounded)  # [3.14, 2.72, 1.62]
```

### Group By (Frequency Tables)

The groupby() function in pandas is a powerful way to create frequency tables and perform aggregations. It allows you to split a dataset into groups based on a categorical variable (e.g., Gender, Eye Color), and then apply a calculation (such as mean, count, or sum) within each group.

This is especially useful when you want to compute summary statistics (e.g., average age by gender) or generate frequency distributions of categorical data.

In [42]:
import pandas as pd

# Create a small dataset with Age, Gender, and Eye Color
data = {
    'Age': [22, 27, 30],
    'Gender': ['m', 'f', 'f'],
    'Eye Color': ['blue', 'green', 'blue']
}

# Convert the dictionary into a pandas DataFrame
demographics = pd.DataFrame(data)

# Display the full demographics table
display("Demographics Table:")
display(demographics)

# Group the data by Gender and calculate the average Age for each group
avg_age_by_gender = demographics.groupby('Gender')['Age'].mean()

# Display the aggregated table (average Age by Gender)
display("Average Age by Gender Table:")
display(avg_age_by_gender)

# The average age for females in our sample data is higher
# than the average age of males. 

'Demographics Table:'

Unnamed: 0,Age,Gender,Eye Color
0,22,m,blue
1,27,f,green
2,30,f,blue


'Average Age by Gender Table:'

Gender
f    28.5
m    22.0
Name: Age, dtype: float64

### Summary

- Built-in Python functions: `sum()`, `len()`, `min()`, `max()`, `round()`  
- `statistics` module: `mean()`, `median()`, `mode()`, `stdev()`  
- NumPy: `mean()`, `median()`, `std()`, `var()`, `percentile()`  
- Use these tools to quickly compute descriptive statistics and manipulate data arrays.

## **1.4: Python Data Types**

Python has several built-in data types that are commonly used for data analysis and programming. 
Below is a summary of the most important ones, along with examples.

### Numeric Types

#### Integers (`int`)

Whole numbers, positive or negative, without a decimal point.

```python
x = 10
y = -3
print(type(x))  # <class 'int'>
```

#### Floating-Point Numbers (`float`)

Numbers that contain a decimal point.

```python
pi = 3.14159
temperature = -5.6
print(type(pi))  # <class 'float'>
```

#### Complex Numbers (`complex`)

Numbers with a real and imaginary part.

```python
z = 2 + 3j
print(type(z))  # <class 'complex'>
```

### Text Types

#### Strings (`str`)

Sequences of characters enclosed in quotes.

```python
name = "Alice"
greeting = 'Hello, World!'
print(type(name))  # <class 'str'>
```

### Sequence Types

#### Lists (`list`)

Ordered, mutable collections of items.

```python
fruits = ["apple", "banana", "cherry"]
fruits.append("date")
print(fruits)
```

#### Tuples (`tuple`)

Ordered, immutable collections of items.

```python
coordinates = (4, 5)
print(coordinates[0])  # 4
```

#### Ranges (`range`)

Represents a sequence of numbers, commonly used in loops.

```python
for i in range(3):
    print(i)  # 0, 1, 2
```

### Dictionaries (`dict`)

#### Mapping Types

Collections of key-value pairs.

```python
student = {"name": "Alice", "age": 20, "GPA": 3.8}
print(student["name"])  # Alice
```


### Boolean Type

#### Booleans (`bool`)
Represents `True` or `False` values.

```python
is_student = True
has_graduated = False
print(type(is_student))  # <class 'bool'>
```

### None Type

#### None (`NoneType`)
Represents the absence of a value.

```python
result = None
print(type(result))  # <class 'NoneType'>
```


### Summary

- **Numeric types**: `int`, `float`, `complex`
- **Text type**: `str`
- **Sequence types**: `list`, `tuple`, `range`
- **Mapping type**: `dict`
- **Set types**: `set`, `frozenset`
- **Boolean type**: `bool`
- **Special type**: `NoneType`






---




---





Understanding these basic data types is essential before working with libraries like pandas and NumPy.
