# Multiclass Classification
### Learn how to use logistic regression with multiple categories.

##### Contents:
- Intro to the data    
    - pandas.Series.unique()
- Dummy variables
    - pd.get_dummies()
    - pd.concat()
    - df.drop()
- Multiclass classification
    - one-versus-all method
- Multiclass logistic regression model
    - training
        - string.startswith()
    - testing
        - model.predict_proba()

## 1: Introduction To The Data

The dataset we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, and how fast the car accelerates. Using this information we will predict the origin of the vehicle, either North America, Europe, or Asia. We can see, that unlike our previous classification datasets, we have three categories to choose from, making our task slightly more challenging.

Here's a preview of the data:

    18.0   8   307.0      130.0      3504.      12.0   70  1    "chevrolet chevelle malibu"
    15.0   8   350.0      165.0      3693.      11.5   70  1    "buick skylark 320"
    18.0   8   318.0      150.0      3436.      11.0   70  1    "plymouth satellite"
    
The dataset is hosted by the University of California Irvine on [their machine learning repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG). As a side note, the UCI Machine Learning repository contains many small datasets which are useful when getting your hands dirty with machine learning.

You'll notice that the **Data Folder** contains a few different files. We'll be working with [auto-mpg.data](https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data), which omits the 8 rows containing missing values for fuel efficiency (`mpg` column). We've converted this data into a CSV file named `auto.csv` for you.

Here are the columns in the dataset:

- `mpg` -- Miles per gallon, Continuous.
- `cylinders` -- Number of cylinders in the motor, Integer, Ordinal, and Categorical.
- `displacement` -- Size of the motor, Continuous.
- `horsepower` -- Horsepower produced, Continuous.
- `weight` -- Weights of the car, Continuous.
- `acceleration` -- Acceleration, Continuous.
- `year` -- Year the car was built, Integer and Categorical.
- `origin` -- Integer and Categorical. 1: North America, 2: Europe, 3: Asia.
- `car_name` -- Name of the car.

#### Instructions:
- Import the Pandas library and read `auto.csv` into a Dataframe named cars.
- Use the `Series.unique()` method to assign the unique elements in the column `origin` to `unique_regions`. Then use the `print` function to display `unique_regions`.

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [19]:
cars = pd.read_csv('data/auto.csv')
unique_regions = cars.origin.unique()
unique_regions

array([1, 3, 2])

## 2: Dummy Variables

In previous classification missions, categorical variables have been represented in the dataset using integer values (like `0` and `1`) for us already. In many cases, like with this dataset, you'll have to create numeric representation of categorical values yourself. For this dataset, categorical variables exist in three columns, `cylinders`, `year`, and `origin`. The `cylinders` and `year` columns must be converted to numeric values so we can use them to predict label `origin`. Even though the column year is a number, we’re going to treat them like categories. The year 71 is unlikely to relate to the year 70 in the same way those two numbers do numerically, but rather just as two different labels. In these instances, it is always safer to treat discrete values as categorical variables.

We must use **dummy variables** for columns containing categorical values. Whenever we have more than 2 categories, we need to create more columns to represent the categories. Since we have 5 different categories of cylinders, we could use `3`, `4`, `5`, `6`, and `8` to represent the different categories. We can split the column into separate binary columns:

- `cyl_3` -- Does the car have 3 cylinders? 0 if False, 1 if True.
- `cyl_4` -- Does the car have 4 cylinders? 0 if False, 1 if True.
- `cyl_5` -- Does the car have 5 cylinders? 0 if False, 1 if True.
- `cyl_6` -- Does the car have 6 cylinders? 0 if False, 1 if True.
- `cyl_8` -- Does the car have 8 cylinders? 0 if False, 1 if True.

We can use the `pandas.get_dummies()` function to return a Dataframe containing binary columns from the values in the `cylinders` column. In addition, if we set the `prefix` parameter to `cyl`, Pandas will pre-pend the column names to match the style we'd like:

    dummy_df = pd.get_dummies(cars["cylinders"], prefix="cyl")

We then use the `pandas.concat()` function to add the columns from this Dataframe back to cars:

    cars = pd.concat([cars, dummy_df], axis=1)

Now it's your turn! Repeat the same process for the `year` column.

#### Instructions:
- Use the pandas.get_dummies() function to create dummy values from the year column.
    - Use the prefix attribute to prepend year to each of the resulting column names.
    - Assign the resulting Dataframe to dummy_years.
- Use the pandas.concat() function to concatenate the columns from dummy_years to cars.
- Use the DataFrame.drop() method to drop the year and cylinders columns from cars.
- Display the first 5 rows of the new cars Dataframe to confirm.

In [20]:
dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
cars = pd.concat([cars, dummy_cylinders], axis=1)

print(cars.year.unique())
dummy_years = pd.get_dummies(cars.year, prefix="year")

cars = pd.concat([cars, dummy_years], axis=1)
cars.drop('year',axis=1,inplace=True)
cars.drop('cylinders',axis=1,inplace=True)
cars.head()

[70 71 72 73 74 75 76 77 78 79 80 81 82]


Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin,cyl_3,cyl_4,cyl_5,cyl_6,...,year_73,year_74,year_75,year_76,year_77,year_78,year_79,year_80,year_81,year_82
0,18.0,307.0,130.0,3504.0,12.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,15.0,350.0,165.0,3693.0,11.5,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,18.0,318.0,150.0,3436.0,11.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,16.0,304.0,150.0,3433.0,12.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,17.0,302.0,140.0,3449.0,10.5,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 3: Multiclass Classification

In previous missions, we explored binary classification, where there were only 2 possible categories, or classes. When we have 3 or more categories, we call the problem a **multiclass classification** problem. There are a few different methods of doing multiclass classification and in this mission, we'll focus on the one-versus-all method.

The one-versus-all method is a technique where we choose a single category as the Positive case and group the rest of the categories as the False case. We're essentially splitting the problem into multiple binary classification problems. For each observation, the model will then output the probability of belonging to each category.

To start let's split our data into a training and test set. We've randomized the cars Dataframe for you already to start things off and assigned the shuffled Dataframe to `shuffled_cars`.

#### Instructions:
- Split the shuffled_cars Dataframe into 2 Dataframes: train and test.
    - Assign the first 70% of the shuffled_cars to train.
    - Assign the last 30% of the shuffled_cars to test.

In [21]:
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]

highest_train_row = int(cars.shape[0] * .70)
train = shuffled_cars.iloc[0:highest_train_row]
test = shuffled_cars.iloc[highest_train_row:]

## 4: Training A Multiclass Logistic Regression Model

In the one-vs-all approach, we're essentially converting an n-class (in our case `n` is 3) classification problem into `n` binary classification problems. For our case, we'll need to train 3 models:

- A model where all cars built in North America are considered Positive (1) and those built in Europe and Asia are considered Negative (0).
- A model where all cars built in Europe are considered Positive (1) and those built in North America and Asia are considered Negative (0).
- A model where all cars built in Asia are labeled Positive (1) and those built in North America and Europe are considered Negative (0).

Each of these models is a binary classification model that will return a probability between 0 and 1. When we apply this model on new data, a probability value will be returned from each model (3 total). For each observation, we choose the label corresponding to the model that predicted the highest probability.

We'll use the dummy variables we created from the `cylinders` and `year` columns to train 3 models using the LogisticRegression class from scikit-learn.

#### Instructions: 

For each value in unique_origins, train a logistic regression model with the following parameters:

- `X`: Dataframe containing just the cylinder & year binary columns.
- `y`: list (or Series) of Boolean values:
    - `True` if observation's value for origin matches the current iterator variable.
    - `False` if observation's value for origin doesn't match the current iterator variable.

Add each model to the models dictionary with the following structure:

- key: origin value (1, 2, or 3),
- value: relevant LogistcRegression model instance.


In [25]:
from sklearn.linear_model import LogisticRegression

unique_origins = cars["origin"].unique()
unique_origins.sort()

models = {}
features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]

for origin in unique_origins:
    model = LogisticRegression()
    X_train = train[features]
    y_train = train.origin == origin
    model.fit(X_train, y_train)
    models[origin] = model    

## 5: Testing The Models

Now that we have a model for each category, we can run our test dataset through the models and evaluate how well they performed.

#### Instructions:
- For each origin value from unique_origins:
    - Use the LogisticRegression predict_proba function to return the 3 lists of predicted probabilities for the test set and add to the testing_probs Dataframe.
- Here's how the final Dataframe should look like (without all zeroes of course!):

|   | 1     | 2     | 3     |
|---|-------|-------|-------|
| 0 | 0.000 | 0.000 | 0.000 |
| 1 | 0.000 | 0.000 | 0.000 |
| 2 | 0.000 | 0.000 | 0.000 |
| 3 | 0.000 | 0.000 | 0.000 |
| 4 | 0.000 | 0.000 | 0.000 |
| 5 | 0.000 | 0.000 | 0.000 |

In [41]:
testing_probs = pd.DataFrame(columns=unique_origins)

for origin in unique_origins:
    testing_probs[origin] = models[origin].predict_proba(test[features])[:,1]
    
testing_probs.head()

Unnamed: 0,1,2,3
0,0.96847,0.030118,0.022281
1,0.955135,0.016504,0.073064
2,0.960664,0.022932,0.038702
3,0.868371,0.083001,0.0599
4,0.325527,0.318606,0.345816


## 6: Choose The Origin

Now that we trained the models and computed the probabilities in each origin we can classify each observation. To classify each observation we want to select the origin with the highest probability of classification for that observation.

While each column in our dataframe `testing_probs` represents an origin we just need to choose the one with the largest probability. We can use the [Dataframe method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmax.html) `.idxmax()` to return a Series where each value corresponds to the column or where the maximum value occurs for that observation. We need to make sure to set the `axis` paramater to `1` since we want to calculate the maximum value across columns. Since each column maps directly to an origin the resulting Series will be the classification from our model.

#### Instructions:

- Classify each observation in the test set using the `testing_probs` Dataframe.
- Assign the predicted origins to `predicted_origins` and use the `print` function to display it.

In [45]:
predicted_origins = testing_probs.idxmax(axis=1)
print(predicted_origins)

0      1
1      1
2      1
3      1
4      3
5      1
6      3
7      1
8      2
9      2
10     1
11     2
12     2
13     1
14     2
15     3
16     1
17     1
18     1
19     1
20     2
21     3
22     1
23     2
24     1
25     1
26     2
27     1
28     2
29     2
      ..
88     3
89     1
90     1
91     3
92     1
93     2
94     1
95     1
96     1
97     1
98     1
99     1
100    3
101    1
102    1
103    1
104    1
105    1
106    1
107    1
108    1
109    1
110    1
111    1
112    2
113    1
114    1
115    1
116    1
117    1
Length: 118, dtype: int64


## 7: Conclusion

In this mission, we learned the basics of extending logistic regression to work for multi-class classification problems. In the next mission, we'll dive into more intermediate linear regression concepts.