In [None]:


### **1. Data Cleaning in Python: Handling Missing Values, Duplicates, and Renaming Columns**

**Handling Missing Values**  
Missing values can be handled in several ways using `pandas`. The common methods are:

- **Identifying missing values**:  
  - `isnull()`: Returns a boolean DataFrame where `True` indicates a missing value.
    ```python
    df.isnull()
    ```
  - `sum()`: Can be used to get the count of missing values in each column.
    ```python
    df.isnull().sum()
    ```

- **Removing missing values**:  
  - `dropna()`: Removes rows with any missing values by default.
    ```python
    df.dropna()  # Removes rows with any NaN values
    df.dropna(axis=1)  # Removes columns with any NaN values
    ```

- **Filling missing values**:  
  - `fillna()`: Replaces missing values with a specified value or a method like forward/backward fill.
    ```python
    df.fillna(0)  # Replace NaN with 0
    df.fillna(method='ffill')  # Forward fill: propagates previous value
    df.fillna(method='bfill')  # Backward fill: propagates next value
    ```

**Handling Duplicates**  
To handle duplicate rows:

- **Identifying duplicates**:
  ```python
  df.duplicated()  # Returns a boolean indicating duplicate rows
  df[df.duplicated()]  # Shows only the duplicate rows
  ```

- **Removing duplicates**:
  - `drop_duplicates()`: Removes duplicate rows, keeping the first or last occurrence.
    ```python
    df.drop_duplicates()  # Removes duplicate rows
    df.drop_duplicates(keep='last')  # Keeps the last occurrence of duplicates
    ```

**Renaming Columns**  
To rename columns, use `rename()`:

- **Renaming specific columns**:
  ```python
  df.rename(columns={'old_name': 'new_name'}, inplace=True)
  ```

- **Renaming all columns**:
  ```python
  df.columns = ['new_col1', 'new_col2', 'new_col3']
  ```

---

### **2. Data Manipulation in Python: GroupBy, Merging, Joining, and Concatenation**

**GroupBy**  
`GroupBy` is used to split the data into groups and perform operations on each group.

- **Basic Grouping**:
  ```python
  df.groupby('column_name').sum()  # Sum of values within each group
  df.groupby('column_name').mean()  # Mean of values within each group
  ```

- **Multiple aggregation functions**:
  ```python
  df.groupby('column_name').agg(['sum', 'mean'])
  ```

**Merging**  
Merging combines two DataFrames based on a common column (like SQL joins).

- **Basic Merge**:
  ```python
  pd.merge(df1, df2, on='common_column', how='inner')  # Performs inner join
  pd.merge(df1, df2, on='common_column', how='left')   # Left join
  pd.merge(df1, df2, on='common_column', how='right')  # Right join
  pd.merge(df1, df2, on='common_column', how='outer')  # Outer join
  ```

**Joining**  
`join()` is used for merging DataFrames based on the index.

- **Basic Join**:
  ```python
  df1.join(df2, on='column_name')  # Join df2 to df1 on 'column_name'
  ```

**Concatenation**  
Concatenation stacks DataFrames either row-wise (vertically) or column-wise (horizontally).

- **Concatenating vertically**:
  ```python
  pd.concat([df1, df2], axis=0)  # Stacks df1 and df2 vertically (default axis=0)
  ```

- **Concatenating horizontally**:
  ```python
  pd.concat([df1, df2], axis=1)  # Stacks df1 and df2 horizontally (axis=1)
  ```

---

### **3. Data Aggregation in Python: Sum, Mean, Count, etc.**

**Aggregation** allows you to summarize data in various ways:

- **Sum**:
  ```python
  df['column_name'].sum()  # Sum of values in a column
  ```

- **Mean**:
  ```python
  df['column_name'].mean()  # Mean of values in a column
  ```

- **Count**:
  ```python
  df['column_name'].count()  # Count non-null values in a column
  ```

- **Min/Max**:
  ```python
  df['column_name'].min()  # Minimum value in a column
  df['column_name'].max()  # Maximum value in a column
  ```

- **Multiple aggregations**:
  ```python
  df.groupby('column_name').agg({'column1': 'sum', 'column2': 'mean'})
  ```

- **Custom aggregation functions**:
  ```python
  df.groupby('column_name').agg(lambda x: x.max() - x.min())
  ```

---

### **4. Reading and Writing Data in Python: CSV, Excel, JSON**

**Reading Data**  
Python provides multiple ways to read data from files:

- **Reading CSV**:
  ```python
  import pandas as pd
  df = pd.read_csv('file.csv')
  ```

- **Reading Excel**:
  ```python
  df = pd.read_excel('file.xlsx', sheet_name='Sheet1')  # Read specific sheet
  ```

- **Reading JSON**:
  ```python
  df = pd.read_json('file.json')
  ```

**Writing Data**  
You can also write DataFrames to various file formats:

- **Writing to CSV**:
  ```python
  df.to_csv('output.csv', index=False)  # Write DataFrame to CSV without row index
  ```

- **Writing to Excel**:
  ```python
  df.to_excel('output.xlsx', index=False)  # Write DataFrame to Excel
  ```

- **Writing to JSON**:
  ```python
  df.to_json('output.json')  # Write DataFrame to JSON
  ```

---

### **5. Handling Time Series Data in Python**

Time series data requires specific handling, including converting data to datetime objects and resampling.

**Converting to DateTime**  
To work with time series data, ensure the date column is in datetime format:

```python
df['date_column'] = pd.to_datetime(df['date_column'])
```

**Setting Date as Index**  
For time series analysis, it's common to set the date as the index:

```python
df.set_index('date_column', inplace=True)
```

**Resampling Time Series**  
Resampling allows you to change the frequency of the data (e.g., daily to monthly):

- **Resample to monthly frequency**:
  ```python
  df.resample('M').sum()  # Sum of values for each month
  ```

- **Resample with other aggregations**:
  ```python
  df.resample('D').mean()  # Daily mean values
  ```

**Rolling Windows**  
To smooth time series data, you can use rolling windows:

- **Moving Average**:
  ```python
  df['rolling_mean'] = df['value_column'].rolling(window=7).mean()  # 7-day moving average
  ```

**Time-Based Indexing**  
You can use date-based indexing for easier slicing:

```python
df['2025-01-01':'2025-02-01']  # Slices data from Jan 1, 2025 to Feb 1, 2025
```

---
