## Assignment 1 - Data Wrangling I

#### **Objective**:
This assignment aims to practice basic data wrangling techniques with **pandas**, focusing on cleaning and preprocessing a dataset. Key tasks include handling missing values, converting data types, and transforming categorical variables into numeric ones.

---

#### **Steps to Accomplish**:

1. **Load Dataset into a pandas DataFrame**:
   - First, the dataset is loaded into a **pandas DataFrame** from a file (e.g., CSV or Excel). This allows us to easily manipulate and explore the data.

2. **Inspect the First Few Rows**:
   - Use the `head()` function to display the first few rows of the dataset, giving you a quick look at the structure of the data.

3. **Check the Dataset's Information**:
   - The `info()` function provides a concise summary of the DataFrame, including the number of non-null values, data types, and column names.

4. **Convert Data Types to Numeric**:
   - If any columns should be numeric but are stored as objects (e.g., strings), we convert them using pandas' `pd.to_numeric()` function.

5. **Describe the Dataset**:
   - The `describe()` function provides summary statistics for numeric columns, helping identify patterns, trends, or anomalies in the data.

6. **Check for Missing Values**:
   - To identify missing values, use `isnull()` which will show how many missing entries are in each column.

7. **Handle Missing Values**:
   - For missing data, you can either fill it with the **mean** of the column or use **forward fill** (`ffill`) to propagate previous values forward.

8. **Rename Columns**:
   - It's a good practice to rename columns to more meaningful names if they are unclear, using the `rename()` function.

9. **Check Dataset Dimensions**:
   - Use the `shape` attribute to check the dimensions of the dataset (i.e., the number of rows and columns).

10. **Convert Categorical Variables to Numeric**:
    - **LabelEncoder** from **sklearn.preprocessing** is used to convert categorical variables into numeric values. This is necessary for machine learning models, which require numerical input.
    - Example: If a column contains categories like 'Male' and 'Female', **LabelEncoder** will convert them into numeric values such as 0 and 1.

---

### **Theory and Concepts Covered**:

- **Data Wrangling**: The process of cleaning and transforming raw data into a usable format for analysis. This includes handling missing values, renaming columns, and changing data types.

- **Handling Missing Data**: Missing values are a common problem in real-world data. Strategies to handle missing data include replacing missing values with the column's mean (for numerical data) or using techniques like **forward filling**.

- **Converting Data Types**: Data often comes in the wrong type (e.g., text instead of numbers). Converting the data to the correct type is crucial for analysis and modeling. Numeric columns are particularly important in machine learning.

- **Categorical to Numeric Conversion**: Machine learning algorithms generally require numerical input. **LabelEncoder** is a simple tool for converting categorical data (e.g., 'Yes', 'No') into numeric values (e.g., 1, 0). This helps the model process categorical information.

---

### **Potential Questions and Answers**:

**Q1: Why do we need to convert categorical variables into numeric values?**  
**A1:** Many machine learning models require numerical input. Categorical variables, like 'Male' and 'Female' or 'Red' and 'Blue', are typically encoded into numeric values to allow the algorithm to interpret the data.

**Q2: What are the different methods to handle missing values?**  
**A2:** Common methods for handling missing values include:
   - **Forward fill (ffill)**: Fills missing values with the previous non-null value.
   - **Backward fill (bfill)**: Fills missing values with the next non-null value.
   - **Filling with mean/median**: Replaces missing values with the mean or median of the column.
   - **Dropping rows/columns**: If the missing data is too significant, rows or columns with missing values can be removed.

**Q3: What is the difference between `LabelEncoder` and **One-Hot Encoding**?**  
**A3:**  
   - **LabelEncoder** assigns each category a unique integer. It's simple and works well for ordinal data (data with a natural order, like 'Low', 'Medium', 'High').
   - **One-Hot Encoding** creates a new binary column for each category. It is typically used for nominal data (data without a natural order, like 'Red', 'Blue', 'Green').



### Assignment 2 - Data Wrangling II

#### **Objective**:

To clean and prepare the dataset by detecting inconsistencies, handling outliers, and applying normalization techniques for better model performance.

---

#### **Steps**:

1. **Scan Variables for Inconsistencies**:

   - Use `unique()` to identify unexpected values or errors in each column, especially for categorical data.

2. **Remove Outliers**:

   - Detect outliers using methods like **IQR (Interquartile Range)** or **Z-score** (values beyond ±3).
   - Optionally, remove or cap outliers to prevent them from affecting the analysis.

3. **Min-Max Normalization**:

   - Scale numerical data to the range [0, 1] using the formula:
     $$
     X_{\text{normalized}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
     $$
   - Useful for models sensitive to feature scale.

4. **Z-score Normalization**:

   - Standardize the data by subtracting the mean and dividing by the standard deviation:
     $$
     Z = \frac{X - \mu}{\sigma}
     $$
   - Best for data with Gaussian distribution.

---

### **Key Concepts**:

- **Inconsistencies**: Check for errors or unexpected values in data.
- **Outliers**: Extreme values that can distort analysis, detected and handled through statistical methods.
- **Normalization**: Min-Max and Z-score techniques adjust data scale for better model performance.

---

### **Questions & Answers**:

**Q1: Min-Max vs Z-score Normalization?**\
**A1**: Min-Max scales data to a range [0, 1], but is sensitive to outliers. Z-score standardizes data, making it less affected by outliers.

**Q2: How to handle outliers?**\
**A2**: Use IQR or Z-score to detect outliers, then remove or cap them based on your approach.

**Q3: Why check for inconsistencies?**\
**A3**: Inconsistencies can lead to errors in analysis and affect model accuracy, so they need to be fixed before further analysis.

---


## Assignment 3

# **Assignment 3: Data Wrangling I - NBA & Iris Dataset**

## **Part 1: NBA Dataset**

### Steps:

1. **Preprocessing & Removing Inconsistencies**:

   - Handle missing or incorrect values (e.g., negative heights).
   - Convert height to a consistent unit (e.g., inches to centimeters).

2. **Remove Outliers**:

   - Detect outliers using Z-score or IQR methods, then remove or cap them.

3. **Central Tendency**:

   - Calculate mean, median, and mode for numerical columns like height, age, and salary.

4. **Variance**:

   - Calculate variance and standard deviation to understand the spread of the data.

5. **Grouping by Age**:

   - Use `pd.cut()` to group players into age bins (e.g., 18-25, 26-30).

### **One-line Example for pd.cut**:

```python
df['age_group'] = pd.cut(df['age'], bins=[18, 25, 30, 35, 40], labels=['18-25', '26-30', '31-35', '36-40'], right=False)
```

---

## **Part 2: Iris Dataset**

### Steps:

1. **Preprocessing & Removing Inconsistencies**:

   - Handle missing values and ensure correct data types.

2. **Remove Outliers**:

   - Use Z-score or IQR methods to detect and remove outliers in numerical columns.

3. **Central Tendency**:

   - Calculate mean, median, and mode for sepal and petal dimensions.

4. **Variance**:

   - Compute variance and standard deviation for numerical columns like sepal length and petal width.

5. **Group by Species**:

   - Group by species and analyze measurements (e.g., mean, median) for each species.

6. **Label & One-Hot Encoding**:

   - Use **LabelEncoder** to convert species to numerical labels.
   - Apply **OneHotEncoder** or `pd.get_dummies()` for categorical variables (e.g., species).

### **One-line Example for OneHotEncoder**:

```python
df_encoded = pd.get_dummies(df, columns=['species'])
```

### **One-line Example for LabelEncoder**:

```python
df['species_encoded'] = le.fit_transform(df['species'])
```

---


## Assignment 10

In [3]:
# IMPORTS

# from nltk import pos_tag
# from nltk.tokenize import word_tokenize, sent_tokenize
# from nltk.corpus import import stopwords
# from nltk.stem import PorterStemmer, WordNetLemmatizer
# from nltk.probability import FreqDist


# Assignment steps

# Import & download
# open and read
# tokenize
# stop word removal
# pos tagging
# stemming
# lemmatization

#TF
#IDF


## Assignment 13

In [None]:
# spark-shell
# paste the program
# nk -lc 9999
# WordCount.main(Array())