# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Run all the cells below.

In [3]:
websites = pd.read_csv('data/website.csv')

In [4]:
websites.drop('SOURCE_APP_PACKETS', axis=1, inplace=True)
websites.drop('REMOTE_APP_PACKETS', axis=1, inplace=True)
websites.drop('REMOTE_APP_BYTES', axis=1, inplace=True)
websites.drop('URL_LENGTH', axis=1, inplace=True)
websites.drop('TCP_CONVERSATION_EXCHANGE', axis=1, inplace=True)

In [5]:
websites.drop('CONTENT_LENGTH', axis=1, inplace=True)

In [6]:
websites.dropna(inplace=True)

In [7]:
def countryfix(name):
    if 'GB' in name or name == 'United Kingdom':
        return 'UK'
    elif name == 'Cyprus':
        return 'CY'
    else:
        return name.upper()

In [8]:
websites['WHOIS_COUNTRY'] = websites['WHOIS_COUNTRY'].apply(countryfix)

In [9]:
topcountries = websites['WHOIS_COUNTRY'].value_counts()[:10].keys()

In [10]:
websites['WHOIS_COUNTRY'] = websites['WHOIS_COUNTRY'].apply(lambda country: country if country in topcountries
                                                            else 'OTHER')

In [11]:
websites.drop(['WHOIS_STATEPRO', 'WHOIS_REGDATE', 'WHOIS_UPDATED_DATE'], axis=1, inplace=True)

In [12]:
websites.drop('URL', axis=1, inplace=True)

In [16]:
def serverfix(name):
    if 'microsoft' in name.lower():
        return 'Microsoft'
    elif 'apache' in name.lower():
        return 'Apache'
    elif 'nginx' in name.lower():
        return 'nginx'
    else:
        return 'Other'

In [17]:
websites['SERVER'] = websites['SERVER'].apply(serverfix)

In [19]:
ctr_dum = pd.get_dummies(websites['WHOIS_COUNTRY']).drop(columns='NONE')

In [20]:
chs_dum = pd.get_dummies(websites['CHARSET']).drop(columns='None')

In [21]:
svr_dum = pd.get_dummies(websites['SERVER']).drop(columns='Other')

In [22]:
website_dummy = websites.join(ctr_dum).join(chs_dum).join(svr_dum).drop(columns=['WHOIS_COUNTRY',
                                                                                 'CHARSET', 'SERVER'])

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
x_train, x_test, y_train, y_test = train_test_split(website_dummy.drop(columns='Type'),
                                                    website_dummy['Type'], train_size=0.8)

In [25]:
from sklearn.linear_model import LogisticRegression

In [26]:
webmodel = LogisticRegression().fit(x_train, y_train)



In [27]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [28]:
y_pred = webmodel.predict(x_test)

In [29]:
confusion_matrix(y_test, y_pred)

array([[302,   7],
       [ 40,   7]])

In [30]:
accuracy_score(y_test, y_pred)

0.8679775280898876

# KNN

#### Our algorithm is K-Nearest Neighbors. 

Start by loading `KNeighborsClassifier` from scikit-learn and then initializing and fitting the model. We'll start off with a model where k=3.

In [40]:
from sklearn.neighbors import KNeighborsClassifier

webmodelknn = KNeighborsClassifier(3)
webmodelknn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

To test your model, compute the predicted values for the testing sample and print the confusion matrix as well as the accuracy score.

In [41]:
y_pred = webmodelknn.predict(x_test)

In [42]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

array([[292,  17],
       [ 13,  34]])

In [43]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.9157303370786517

#### We'll create another K-Nearest Neighbors model with k=5. 

Initialize and fit the model below and print the confusion matrix and the accuracy score.

In [44]:
webmodelknn5 = KNeighborsClassifier(5)
webmodelknn5.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [45]:
y_pred = webmodelknn5.predict(x_test)
confusion_matrix(y_test, y_pred)

array([[299,  10],
       [ 14,  33]])

In [46]:
accuracy_score(y_test, y_pred)

0.9325842696629213

Did you see an improvement in the confusion matrix when increasing k to 5? Did you see an improvement in the accuracy score? Write your conclusions below.

In [8]:
#não há uma melhoria significativa

# Bonus Challenge - Feature Scaling

Problem-solving in machine learning is iterative. You can improve your model prediction with various techniques (there is a sweetspot for the time you spend and the improvement you receive though). Now you've completed only one iteration of ML analysis. There are more iterations you can conduct to make improvements. In order to be able to do that, you will need deeper knowledge in statistics and master more data analysis techniques. In this bootcamp, we don't have time to achieve that advanced goal. But you will make constant efforts after the bootcamp to eventually get there.

However, now we do want you to learn one of the advanced techniques which is called *feature scaling*. The idea of feature scaling is to standardize/normalize the range of independent variables or features of the data. This can make the outliers more apparent so that you can remove them. This step needs to happen during Challenge 6 after you split the training and test data because you don't want to split the data again which makes it impossible to compare your results with and without feature scaling. For general concepts about feature scaling, click [here](https://en.wikipedia.org/wiki/Feature_scaling). To read deeper, click [here](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e).

In the next cell, attempt to improve your model prediction accuracy by means of feature scaling. A library you can utilize is `sklearn.preprocessing.RobustScaler` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)). You'll use the `RobustScaler` to fit and transform your `X_train`, then transform `X_test`. You will use logistic regression to fit and predict your transformed data and obtain the accuracy score in the same way. Compare the accuracy score with your normalized data with the previous accuracy data. Is there an improvement?

In [None]:
# Your code here