# machine learning basics cheat sheet


## linear regression

Steps

A. Exploratory data analysis
sns.pairplot(USAhousing) using seaborn look patterns
sns.heatmap(USAhousing.corr()) correlation of X with y

B. trainig model 

1. train test data split
X = USAhousing[col_1, col_2,...]
y = USAhousing[target variable]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101) test size is 0.3 to 0.4

C. creating and testing model
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df will print out correlation of various input parameters with target variables
for example : - Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an **increase of $21.52 **.

Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an **increase of $164883.28 **.
D. testing the predicted against test data
predictions = lm.predict(X_test)
plt.scatter(y_test,predictions) or any other graph like residual histogram
sns.distplot((y_test-predictions),bins=50)

Important points regarding error analysis

MAE is the easiest to understand, because it's the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

## logistic regression ( used for yes and no questions)

Steps

A. Exploratory data analysis 
sns.pairplot(USAhousing) using seaborn look patterns 

we explore what data is missing using heatmaps and isnull
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

if there are many missing data delete the columns 
if few are missing like 20 % then use average value or average value of subtypes(column) this can be done by simply looking at the box plots, or by using some mathematical method.

We'll need to convert categorical features(like sex, or some category) to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

we use pandas for this:-
sex = pd.get_dummies(train['Sex'],drop_first=True)

then drop all the parameters which have no numerical significance or define numerical data in X for model.

X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1),train['Survived'], test_size=0.30,random_state=101)

from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))






## knn (used for classifying one datapoint to various datasets)

Standardize the Variables
Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])# without target class

split data

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)# we can change n_neighbours
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,pred))# here we chick performance of model

now we optimize the value of n_neighbour by comparing the prediction data against test data where they were not equal

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

hence k versus error rate was graphed and we may conclude that after n_neighbour > 23 almost same error rate is present.


## decision tree and random forest( again used for yes and no decesions)

In order to pick which feature to split on, we need a way of measuring how good the split is. This is where information gain and entropy come in.

We would like to choose questions that give a lot of information about the tree’s prediction. 
For example, if there is a single yes/no question that accurately predicts the outputs 99% of the time, then that question allows us to “gain” a lot of information about our data. In order to measure how much information we gain, we introduce entropy.

The entropy is a measure of uncertainty associated with our data.

We can intuitively think that if a data set had only one label (e.g. every passenger survived), then we have a low entropy. So we would like to split our data in a way that minimizes the entropy. The better the splits, the better our prediction will be.

For the pros, Decision Trees are easily interpretable and can handle missing values and outliers. They can also handle discrete and continuous data types, along with irrelevant features.

For the cons, Decision Trees can be very easy to overfit, and while they are computationally cheap for prediction, training the decision tree can be computationally expensive.

The idea behind a **Random Forest** is actually pretty simple: We repeatedly select data from the data set (with replacement) and build a Decision Tree with each new sample


from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
print(classification_report(y_test,predictions))

**we can even print out the decisioon tree using pydt library**

similar code for random forest 

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)# 100 random trees used
rfc.fit(X_train, y_train)
print(classification_report(y_test,rfc_pred))# slightly better performance of random forest


## support vector machines

these are mainly used for classification, regression
by making a hyperplane to an n-dimensional dataset and measuring the error in that dataset.
Its used when there are many data points and performs much better than decsion trees when many parameters are considered.

from sklearn.svm import SVC
model = SVC()
model.fit(X_train,y_train)

**confusion matrix
TP,FP
FN,TN**
if predictions are 0 in either category we have to use grid search
this may happen because the data higly biased


Finding the right parameters (like what C or gamma values to use) is a tricky task! But luckily, we can be a little lazy and just try a bunch of combinations and see what works best! This idea of creating a 'grid' of parameters and just trying out all the possible combinations is called a Gridsearch, this method is common enough that Scikit-learn has this functionality built in with GridSearchCV! The CV stands for cross-validation which is the

GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested.

**It basically adjusts the hyperparameters of a model and gives the parameters with the most relevant values of these paameters.**

param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 

here we vary the parameters c and gamma.

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)

One of the great things about GridSearchCV is that it is a meta-estimator. It takes an estimator like SVC, and creates a new estimator, that behaves exactly the same - in this case, like a classifier. You should add refit=True and choose verbose to whatever number you want, higher the number, the more verbose (verbose just means the text output describing the process).

What fit does is a bit more involved then usual. First, it runs the same loop with cross-validation, to find the best parameter combination. Once it has the best combination, it runs fit again on all data passed to fit (without cross-validation), to built a single new model using the best parameter setting.

grid.best_params_ # this gives the best parameter pair values

grid.best_estimator_ # this is now best svc for the data and similar things can be done on this model like on the svc 

grid_predictions = grid.predict(X_test)

print(classification_report(y_test,grid_predictions)) # we get a much better result using this.






## k means clustering(unsupervised learning)- no target to predict

K-means is an unsupervised learning method for clustering data points. The algorithm iteratively divides data points into K clusters by minimizing the variance in each cluster.

First, each data point is randomly assigned to one of the K clusters. Then, we compute the centroid (functionally the center) of each cluster, and reassign each data point to the cluster with the closest centroid. We repeat this process until the cluster assignments for each data point are no longer changing.

K-means clustering requires us to select K, the number of clusters we want to group the data into. The elbow method lets us graph the inertia (a distance-based metric) and visualize the point at which it starts decreasing linearly. This point is referred to as the "eblow" and is a good estimate for the best value for K based on our data.

A bank wants to give credit card offers to its customers. Currently, they look at the details of each customer and, based on this information, decide which offer should be given to which customer.

Now, the bank can potentially have millions of customers. Does it make sense to look at the details of each customer separately and then make a decision? Certainly not! It is a manual process and will take a huge amount of time.

So what can the bank do? One option is to segment its customers into different groups. For instance, the bank can group the customers based on their income, basically create clusters.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(data[0])

comparison k means vs actual data

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))
ax1.set_title('K Means')
ax1.scatter(data[0][:,0],data[0][:,1],c=kmeans.labels_,cmap='rainbow')
ax2.set_title("Original")
ax2.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='rainbow')

The k value in k-means clustering is a crucial parameter that determines the number of clusters to be formed in the dataset. Finding the optimal k value in the k-means clustering can be very challenging, especially for noisy data. The appropriate value of k depends on the data structure and the problem being solved. It is important to choose the right value of k, as a small value can result in under-clustered data, and a large value can cause over-clustering.

for optimizing the value of k we use the elbow method

data = list(zip(x, y))
inertias = []

for i in range(1,11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(data)
    inertias.append(kmeans.inertia_)

plt.plot(range(1,11), inertias, marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()


## natural language processing



In [6]:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
np.argmax(arr)


TypeError: 'list' object is not callable

In [None]:
range(10,1,-1)