## **Exploring Missing Data**

In [2]:
import pandas as pd

volunteer = pd.read_csv('volunteer_opportunities.csv')

# Drop the Latitude and Longitude columns from volunteer
volunteer_cols = volunteer.drop(["Latitude", "Longitude"], axis=1)

# Drop rows with missing category_desc values from volunteer_cols
volunteer_subset = volunteer_cols.dropna(subset=["category_desc"])

# Print out the shape of the subset
print(volunteer_subset.shape)

(617, 33)


## **Working With Data Types**

In [None]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer['hits'].astype('int')

# Look at the dtypes of the dataset
print(volunteer.dtypes)

## **Training and Test Sets**

### **Why use stratify?**

The `stratify` parameter in the `train_test_split` function of scikit-learn is used to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set.


In many real-world problems, especially in classification tasks, the distribution of classes might be imbalanced or skewed. For instance, in a dataset for disease diagnosis, the majority of samples might be of the 'no disease' class with a small fraction of 'disease' class. Without stratification, there's a risk that the train or test set might end up having an unrepresentative distribution of classes, which can bias the model training and evaluation.

In [None]:
from sklearn.model_selection import train_test_split

# Create a DataFrame with all columns except category_desc
X = volunteer_subset.drop(['category_desc'], axis=1)

# Create a category_desc labels dataset
y = volunteer_subset[['category_desc']]

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Print the category_desc counts from y_train
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64


## **Standarization**

Standardization, Log Normalization, and Feature Scaling are techniques used in data preprocessing, especially in machine learning, to make data more suitable for use in these models. Here's a brief overview of each:

1. **Standardization**:
   - **Description**: Standardization involves rescaling the features so that they have a mean of 0 and a standard deviation of 1. It's achieved by subtracting the mean and then dividing by the standard deviation.
   - **When to Use**: It's particularly useful when the features in a dataset have different units or very different scales. Standardization is also important when using algorithms that are sensitive to the scale of the data, like Support Vector Machines (SVMs), k-Nearest Neighbors (k-NN), and Principal Component Analysis (PCA).


In summary, standardization, log normalization, and feature scaling are essential techniques in the preprocessing phase of a machine learning pipeline. They help normalize data, reduce skewness, and ensure that the algorithm treats all features equally. The choice of technique largely depends on the nature of your data and the specific requirements of the machine learning algorithm you are using.

### **Variance and Bias**

High-low variance and bias in machine learning are concepts that describe the types of errors a model can have. Let's break them down into simpler terms:

1. **Bias**:
   - **High Bias**: This means the model is overly simplistic. It doesn't learn enough from the training data, sort of like not paying enough attention in class. Because of this, it makes a lot of mistakes both on the training data and on new, unseen data. This is known as underfitting.
   - **Low Bias**: In this case, the model pays more attention to the training data and learns it quite well. It makes fewer mistakes on the training data, meaning it fits the training data better.

2. **Variance**:
   - **High Variance**: This happens when the model is like a student who only focuses on memorizing everything for the test and doesn’t really understand the concepts. It performs really well on the training data (like acing a test it saw before) but poorly on new, unseen data because it's too focused on the specific examples it was trained on. This is overfitting.
   - **Low Variance**: A model with low variance doesn’t change its prediction method much with different training data. It's like a student with a solid understanding of concepts, who can answer different types of questions on the same topic.

In an ideal world, you want a model with low bias and low variance, which means it learns well from the training data and generalizes well to new data. However, in reality, there's usually a trade-off between bias and variance (known as the bias-variance tradeoff). Balancing them is key to building a good predictive model.

## **Log Normalization**

   - **Description**: This technique involves applying the natural logarithm to your data to help manage skewed distributions. It can make highly skewed distributions less skewed and can help with making patterns more interpretable and accessible.
   - **When to Use**: Log normalization is useful when you have data with a wide range of values, or when a few extreme values are dominating the learning process and you want to reduce their impact.

In [None]:
import numpy as np

wine = pd.read_csv('wine_types.csv')

# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())

99166.71735542436
0.17231366191842012


## **Feature Scaling**

   - **Description**: Feature scaling is a broader term that includes methods to scale the range of features. It typically involves scaling features to a range, for example, 0 to 1. This is known as Min-Max scaling.
   - **When to Use**: Like standardization, feature scaling is critical when using algorithms that are sensitive to the magnitude of values and where distance between data points is important (e.g., k-NN, SVM, neural networks). It's also useful for speeding up gradient descent in learning algorithms.

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create the scaler
scaler = StandardScaler()

# Subset the DataFrame you want to scale
wine_subset = wine[["Ash", "Alcalinity of ash", "Magnesium"]]

# Apply the scaler to wine_subset
wine_subset_scaled = scaler.fit_transform(wine_subset)

## **Standarized data and modeling**

**KNN on unscaled data**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

knn = KNeighborsClassifier()
scaler = StandardScaler()

knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))

**KNN on scaled data**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Instantiate a StandardScaler
scaler = StandardScaler()

# Scale the training and test features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train_scaled, y_train)

# Score the model on the test data
print(knn.score(X_test_scaled, y_test))

## **Feature Engineering**

### **When to use LabelEncoder or OneHotEncoding?**

LabelEncoder and One Hot Encoding are techniques used for encoding categorical variables in machine learning. Deciding which one to use depends on the specific requirements of your dataset and the model you are using. Here are some guidelines:

1. **LabelEncoder**:
   - **When to Use**: LabelEncoder is typically used when the categorical feature is ordinal, meaning there is a clear ordering in the categories. For example, sizes like Small, Medium, Large can be encoded as 0, 1, 2.
   - **How it Works**: It simply converts each category into an integer. This is useful for models that can interpret the ordinal nature of the variable.
   - **Limitations**: It can introduce a new problem in the model, as it implies an order among the categories which might not exist (e.g., encoding cities or country names).

2. **One Hot Encoding**:
   - **When to Use**: One Hot Encoding is used when the categorical feature is nominal (i.e., there is no intrinsic ordering of the categories). For instance, color categories such as Red, Blue, Green.
   - **How it Works**: It creates a new binary column for each category in the original variable. Each observation gets a 1 in the column of its corresponding category and 0 in all other new columns.
   - **Limitations**: This can lead to a dramatic increase in the dataset’s dimensionality (known as the curse of dimensionality), especially if the categorical variable has many unique categories.

3. **Considerations for Model Type**:
   - Some models like tree-based algorithms (e.g., Decision Trees, Random Forests) can work well with ordinal encodings, as they can handle the intrinsic ordering.
   - Other models, particularly linear models (like Logistic Regression), and neural networks usually work better with One Hot Encoding as they treat each feature independently.

4. **Dataset Size and Computational Constraints**:
   - One Hot Encoding can be computationally expensive for variables with a large number of categories. In such cases, techniques like embedding or dimensionality reduction might be more efficient.

5. **Hybrid Approaches**:
   - Sometimes, a hybrid approach might be useful, like using Label Encoding for features with high cardinality and One Hot Encoding for features with fewer categories.

In summary, the choice between LabelEncoder and One Hot Encoding depends on the nature of the categorical variable (ordinal vs. nominal), the type of model you are using, and the size and computational constraints of your dataset.

**Encoding categorical variables - binary**

In [1]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

hiking = pd.read_json('hiking.json')

# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
print(hiking[['Accessible', 'Accessible_enc']].head())

  Accessible  Accessible_enc
0          Y               1
1          N               0
2          N               0
3          N               0
4          N               0


**Encoding categorical variables - one hot encoding**

In [3]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer['category_desc'])

# Take a look at the encoded columns
print(category_enc.head())

   Education  Emergency Preparedness  Environment  Health  \
0          0                       0            0       0   
1          0                       0            0       0   
2          0                       0            0       0   
3          0                       0            0       0   
4          0                       0            1       0   

   Helping Neighbors in Need  Strengthening Communities  
0                          0                          0  
1                          0                          1  
2                          0                          1  
3                          0                          1  
4                          0                          0  


## **Putting it all together**

**Checking column types**

In [None]:
import pandas as pd

ufo = pd.read_csv('ufo_sightings_large.csv')

# Print the DataFrame info
print(ufo.info())

# Change the type of seconds to float
ufo["seconds"] = ufo['seconds'].astype('float')

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo['date'])

# Check the column types
print(ufo.info())

**Dropping Missing Data**

In [5]:
# Count the missing values in the length_of_time, state, and type columns, in that order
print(ufo[['length_of_time', 'state', 'type']].isna().sum())

# Drop rows where length_of_time, state, or type are missing
ufo_no_missing = ufo.dropna(subset=['length_of_time', 'state', 'type'])

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

length_of_time    143
state             419
type              159
dtype: int64
(4283, 11)


**Extracting numbers from strings**

In [10]:
ufo['length_of_time']

0               2 weeks
1                30sec.
2                   NaN
3       about 5 minutes
4                     2
             ...       
4930    about 5 seconds
4931         25 seconds
4932      early morning
4933            2 hours
4934          1 minutes
Name: length_of_time, Length: 4935, dtype: object

In [14]:
import re

def return_minutes(time_string):

    # Search for numbers in time_string
    num = re.search(r'\d+', time_string)
    if num is not None:
        return int(num.group(0))

# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

# Take a look at the head of both of the columns
print(ufo[['length_of_time', 'minutes']].head())

<re.Match object; span=(0, 1), match='2'>
<re.Match object; span=(0, 2), match='30'>


TypeError: ignored

**Apply log normalization**

In [None]:
# Check the variance of the seconds and minutes columns
print(ufo[['seconds', 'minutes']].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo['seconds'])

# Print out the variance of just the seconds_log column
print(ufo['seconds_log'].var())

**Encoding categorical variables**

In [None]:
# Use pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda x: 1 if x == 'us' else 0)

# Print the number of unique type values
print(len(ufo['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

**Features from date**

In [None]:
# Look at the first 5 rows of the date column
print(ufo['date'].head())

# Extract the month from the date column
ufo["month"] = ufo["date"].dt.month

# Extract the year from the date column
ufo["year"] = ufo["date"].dt.year

# Take a look at the head of all three columns
print(ufo[['date', 'month', 'year']].head())

**Text Vectorization**

In [None]:
# Take a look at the head of the desc field
print(ufo['desc'].head())

# Instantiate the tfidf vectorizer object
vec = TfidfVectorizer()

# Fit and transform desc using vec
desc_tfidf = vec.fit_transform(ufo['desc'])

# Look at the number of columns and rows
print(desc_tfidf.shape)