# Hands-On Tutorial 2: Linear Models
Today we will learn how to implement Logistic and Linear Regression.

## 2. Logistic Regression
Logistic regression is a supervised learning classification algorithm, which means it can estimate the class of new observation based on labeled observations. In essence, logistic regression models the probability that an observation belongs to a particular category.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

In [None]:
X, y = make_classification(
        n_samples=10,
        n_features=2,
        n_redundant=0,
        n_informative=2,
        random_state=1,
        n_clusters_per_class=1)
print(X)
print(y)

In [None]:
model = LogisticRegression(C=1e20, solver='lbfgs')
model.fit(X, y)
preds = model.predict(X)

score_sklearn = (preds == y).mean()
print('Score Sklearn: {}'.format(score_sklearn))
print(model.intercept_, model.coef_)

## 3. TF-IDF and Naive Bayes
Now we will understand text classification with TF-IDF vectorization using a Multinomial Naive Bayes Classifier


### Loading the Dataset
We are using textual data from news articles with 20 different categories. For this tutorial, we have chosen 4 categories: atheism, religion.christian, med and graphics.


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

In [None]:
news = fetch_20newsgroups()
news.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Preprocessing the Data
Here, we will divide our data according to categories into training and testing samples.

In [None]:
target_categories = ['alt.atheism','comp.graphics','sci.med','soc.religion.christian']

train = fetch_20newsgroups(subset='train', categories=target_categories, remove=('headers', 'footers', 'quotes'))
test = fetch_20newsgroups(subset='test', categories=target_categories, remove=('headers', 'footers', 'quotes'))

### Visualizing the Data

In [None]:
print(f'CATEGORY: {target_categories[train.target[0]]}')
print('-' * 80)
print(train.data[0])
print('-' * 80)

CATEGORY: comp.graphics
--------------------------------------------------------------------------------
Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
--------------------------------------------------------------------------------


### TF-IDF Vectorization

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', token_pattern=r'\b[a-zA-Z]{2,}\b', min_df = 0.001 )

In [None]:
sample_sentences = [
    'My name is George, this is my name',
    'I like apples',
    'apple is my favorite fruit'
    ]

sample_tfidf = tfidf_vectorizer.fit(sample_sentences)
# tfidf_vectorizer.transform(sample_sentences).toarray()

array([[0.        , 0.        , 0.        , 0.        , 1.        ,
        0.        ],
       [0.        , 0.70710678, 0.        , 0.        , 0.        ,
        0.70710678],
       [0.57735027, 0.        , 0.57735027, 0.57735027, 0.        ,
        0.        ]])

#### Quick Exercise!
Please finish the code below to vectorize the entire training and testing samples.

In [None]:
#Finish the code below using the example from above.
train_tfidf = tfidf_vectorizer.fit_transform(train.data)
test_tfidf = tfidf_vectorizer.transform(test.data)

### Features

In [None]:
feature_names = tfidf_vectorizer.get_feature_names_out()
train_dense = train_tfidf.toarray()
df_tfidf = pd.DataFrame(train_dense, columns=feature_names)

feature_names

array(['aaron', 'ab', 'abandon', ..., 'zip', 'zoom', 'zooming'],
      dtype=object)

### Classifier
We will build and train our Naive Bayes model now.

In [None]:
naive_bayes = MultinomialNB()
naive_bayes.fit(train_tfidf, train.target)

### Prediction

In [None]:
prediction = naive_bayes.predict(test_tfidf)
accuracy = accuracy_score(test.target, prediction)
print(f'Accuracy: {accuracy}')

Accuracy: 0.788948069241012


### Custom Testing

In [None]:
text = [
    'i do believe in jesus',
    'Nvidia released new video card',
    'an apple a day keeps the doctor away',
    'god does not exist',
    'My monitor supports HDR',
    'Vitamins are essential for your health and development'
]

check = naive_bayes.predict(tfidf_vectorizer.transform(text))

for i in range(len(check)):
    print(f'"{target_categories[check[i]]:<22}" ==> "{text[i]}"')

"soc.religion.christian" ==> "i do believe in jesus"
"comp.graphics         " ==> "Nvidia released new video card"
"sci.med               " ==> "an apple a day keeps the doctor away"
"soc.religion.christian" ==> "god does not exist"
"comp.graphics         " ==> "My monitor supports HDR"
"sci.med               " ==> "Vitamins are essential for your health and development"


In [None]:
### Adult


### Download and loading data


In [None]:
### summary

In [None]:
### Drop columns

x = x.drop([your_columns])

In [None]:
### Processing
### Fill missing values
x[x.isna()]

# how?

has_nan_cols = which columsn has nan
for col in has_nan_cols:
    # your code here
    # hint: average

In [None]:
# train val test split, cross_validation

In [None]:
### build model and validation, regularization

In [None]:
# hyperparameter

for alpha in range:





In [None]:
### feature importance