<a href="https://colab.research.google.com/github/azzezzanf/Learn/blob/main/Exercise_Documentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise Documentation Part 1
Azzezza NF


## Coding Exercise 3: Encoding Categorical Data for Machine Learning


1. Import required libraries - Pandas, Numpy, and required classes for this task - ColumnTransformer, OneHotEncoder, LabelEncoder.

2. Start by loading the Titanic dataset into a pandas data frame. This can be done using the pd.read_csv function. The dataset's name is 'titanic.csv'.

3. Identify the categorical features in your dataset that need to be encoded. You can store these feature names in a list for easy access later.

4. To apply OneHotEncoding to these categorical features, create an instance of the ColumnTransformer class. Make sure to pass the OneHotEncoder() as an argument along with the list of categorical features.

5. Use the fit_transform method on the instance of ColumnTransformer to apply the OneHotEncoding.

6. The output of the fit_transform method should be converted into a NumPy array for further use.

7. The 'Survived' column in your dataset is the dependent variable. This is a binary categorical variable that should be encoded using LabelEncoder.

8.  Print the updated matrix of features and the dependent variable **vector**

AzzezzaNF

In [None]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder


# Load the dataset
df = pd.read_csv('titanic.csv')

# Identify the categorical data
categorical_features = ['Embarked', 'Pclass', 'Sex']

# Implement an instance of the ColumnTransformer class
ct = ColumnTransformer(transformers =[('encoder', OneHotEncoder(), categorical_features)] , remainder = 'passthrough' )

# Apply the fit_transform method on the instance of ColumnTransformer
X = ct.fit_transform(df)

# Convert the output into a NumPy array
X = np.array(X)

# Use LabelEncoder to encode binary categorical data
le = LabelEncoder()
y = le.fit_transform(df['Survived'])

# Print the updated matrix of features and the dependent variable vector
print(X)
print(y)


###Explanation

1. Importing the necessary libraries. Import pandas for data manipulation, numpy for numerical operations, and the necessary classes from scikit-learn for preprocessing.
```
# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
```

2. Load the dataset. The Titanic dataset is loaded into a pandas DataFrame from a CSV file.

```
# Load the dataset
df = pd.read_csv('titanic.csv')
```

3. Identify the categorical data. Specify which features in our dataset are categorical. In this case, 'Sex', 'Embarked', and 'Pclass' are the categorical features.

```
# Identify the categorical data
categorical_features = ['Sex', 'Embarked', 'Pclass']
```

4. Implement an instance of the ColumnTransformer clas. Initialize a ColumnTransformer that will apply a OneHotEncoder to the categorical features. The remainder='passthrough' argument ensures that the non-transformed features are not discarded.

```
# Implement an instance of the ColumnTransformer class
ct = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(), categorical_features)
    ], remainder='passthrough')

```
5. Apply the fit_transform method. Fit the ColumnTransformer to our DataFrame and transform the data. This applies one-hot encoding to our categorical features, converting them into numerical data suitable for a machine-learning model.

```
# Apply the fit_transform method on the instance of ColumnTransformer
X = ct.fit_transform(df)
```
6. Convert the output into a NumPy array. Convert the output to a NumPy array: The output of the ColumnTransformer is a sparse matrix - convert this to a dense NumPy array for easier manipulation.

```
# Convert the output into a NumPy array
X = np.array(X)
```

7. Use LabelEncoder to encode binary categorical data. The 'Survived' feature is our dependent variable. Since it is a binary categorical feature, we use LabelEncoder to transform it into numerical data.

```
# Use LabelEncoder to encode binary categorical data
le = LabelEncoder()
y = le.fit_transform(df['Survived'])
```

8. Print the transformed feature matrix and dependent variable vector to verify that our preprocessing steps have been applied correctly.

```
# Print the updated matrix of features and the dependent variable vector
print("Updated matrix of features: \n", X)
print("Updated dependent variable vector: \n", y)
```

## Coding Exercise 4: Dataset Splitting and Feature Scaling

Coding Exercise 4: Dataset Splitting and Feature Scaling

1. Import necessary Python libraries: pandas, train_test_split from sklearn.model_selection, and StandardScaler from sklearn.preprocessing.

2. Load the Iris dataset using Pandas read.csv. Dataset name is iris.csv.

3. Use train_test_split to split the dataset into an 80-20 training-test set.

4. Apply random_state with 42 value in train_test_split function for reproducible results.

5. Print X_train, X_test, Y_train, and Y_test to understand the dataset split.

6. Use StandardScaler to apply feature scaling on the training and test sets.

7. Print scaled training and test sets to verify feature scaling.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


# Load the Iris dataset
df = pd.read_csv('iris.csv')
# Separate features and target
X = df.drop(columns='target', axis = 1)
y = df['target']
# Split the dataset into an 80-20 training-test set
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state =42)
print('x train : \n', x_train)
print('\n x test : \n', x_test)
print('\n y train : \n', y_train)
print('\n x test : \n', y_test)
# Apply feature scaling on the training and test sets
sc = StandardScaler()
X_train = sc.fit_transform(x_train)
X_test = sc.fit_transform(x_test)
# Print the scaled training and test sets
print('x train scaled : \n', X_train)
print('\n x test scaled : \n', X_test)

## Coding exercise 5: Feature scaling for Machine Learning

1. Import the necessary libraries for data preprocessing, including the

2. StandardScaler and train_test_split classes.

3. Load the "Wine Quality Red" dataset into a pandas DataFrame. You can use the pd.read_csv function for this. Make sure you set the correct delimeter for the file.

4. Split your dataset into an 80-20 training-test set. Set random_state to 42 to ensure reproducible results.

5. Create an instance of the StandardScaler class.

6. Fit the StandardScaler on features from the training set, excluding the target variable 'Quality'.

7. Use the "fit_transform" method of the StandardScaler object on the training dataset.

8. Apply the "transform" method of the StandardScaler object on the test dataset.

9. Print your scaled training and test datasets to verify the feature scaling process.



In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load the Wine Quality Red dataset
df = pd.read_csv('winequality-red.csv', delimiter = ';')

# Separate features and target
X = df.drop(columns=['quality'])
y = df['quality']
# Split the dataset into an 80-20 training-test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 42)

# Create an instance of the StandardScaler class
sc = StandardScaler()

# Fit the StandardScaler on the features from the training set and transform it
X_train = sc.fit_transform(X_train)

# Apply the transform to the test set
X_test = sc.transform(X_test)

# Print the scaled training and test datasets
print(X_train)
print(X_test)

### Explanation
1. Import necessary libraries: We start by importing the necessary libraries for data preprocessing. This includes pandas for data manipulation, train_test_split from sklearn.model_selection to split our dataset into training and test sets, and StandardScaler from sklearn.preprocessing to apply feature scaling.


```
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
```

2. Load the dataset: The Wine Quality Red dataset is loaded into a pandas DataFrame using the pd.read_csv function. Here, we need to specify the correct delimiter, which in this case is a semicolon.

```
df = pd.read_csv('winequality-red.csv', delimiter=';')

```

3. Split the dataset into a training set and a test set: We separate the target variable 'quality' from the features and then split the dataset into an 80-20 training-test set using the train_test_split function.

```
X = df.drop('quality', axis=1) y = df['quality'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

```

4. Create an instance of the StandardScaler class: The StandardScaler class is used to standardize features by removing the mean and scaling to unit variance.

```
sc = StandardScaler()

```
5. Fit the StandardScaler on the training set and transform the data: The StandardScaler is fitted to the training set and then used to transform both the training and test datasets.

```
X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)

```

6. Print the scaled datasets: Finally, we print the scaled training and test datasets to verify the feature scaling process.

```
print("Scaled training set:\n", X_train) print("Scaled test set:\n", X_test)

```