<a href="https://colab.research.google.com/github/carighi/al_ml_workshop/blob/main/Useful_tips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Goal
The goal of this notebook is to collect libraries and functions that you may need in your final project

##Info about libraries

Use `import` to import a package or a function:

*  Import a library and assign an alias:  `import pandas as pd` #This imports the pandas library and gives it the alias pd

*  Import a class within a library, so it can be used without needed to prefix it with package name: `from sklearn.preprocessing import OneHotEncoder` #Importing OneHotEncoder directly allows you to use it without needing to prefix it with sklearn.preprocessing.


###Libraries for final project
These imports set up the necessary tools for data manipulation, visualization, model training, and evaluation.

`from pandas import read_csv` Imports the read_csv function from the pandas library, which is used to read data from a CSV file into a DataFrame.

`from numpy import asarray` Imports the asarray function from the numpy library, which is used to convert input data into an array.

`from matplotlib import pyplot as plt`  Imports the pyplot module from the matplotlib library, which is used for creating plots and visualizations.

`from sklearn.model_selection import train_test_split` Imports the train_test_split function from scikit-learn, which is used to split data into training and testing sets.

`from sklearn.metrics import classification_report` Imports functions for evaluating the performance of a classification model

`from sklearn.metrics import confusion_matrix` Imports functions for evaluating the performance of a classification model

`from sklearn.metrics import accuracy_score` Imports functions for evaluating the performance of a classification model

`from sklearn.neighbors import KNeighborsClassifier` Imports the KNeighborsClassifier class, which is used to create a k-nearest neighbors classifier.

`from sklearn.svm import SVC` Imports the SVC class, which is used to create a support vector machine classifier.

##Functions to upload files
###Upload csv file using pandas

Example 1
```
from pandas import read_csv
#create variable for assigning file location
file_location = "filepath and name"

#if you have to add headers, use name= parameter.
colnames = ['name1','name2', ....,'name3']
read_csv (file_location, names=colnames)
```
###Example 2


```
from pandas import read_csv
!wget -O filename.csv "url"
#Then you need to read with pandas
dataset=read_csv('filename.csv')
#if your dataset has no headers then you shoud add the parameter header=none when you read the file
dataset=read_csv('filename.csv', header=None)
# or add the header with parameter name=['name1','name2', ....,'name3']
```





##Creating training and test datasets

1. First we need to split features and labels (target) into x and y datasets.

This part uses train_test_split (need to import if you don't have it)
Once you have your dataset values in an array



```
#let's say you assign your dataset preprocessed values to array

array=<dataset values>

#you need to define the columns that are the features to x and those that are the labels to y

x= array [<range for feature data>]

y= array [<add range for labels>]
```

2. Now that we have the features and labels (target), then you can split into training and test set

`X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=<test size>, random_state=<any number>)`


##Run models
SVM and K-near neighbors are from sklearn.
The skeleton is similar but parameters are different

**For SVM**

Set C and Kernel parameters
```
from sklearn.svm import SVC
#remember 'random_state' parameter is used for reproducibility of the results each time we run this cell
svm = SVC(kernel="linear", C=<value>, random_state=<number>)
svm.fit(X_train, y_train)
#score returns the mean accuracy
predictions = svm.predict (X_test)
svm.score(X_test, y_test)
```

**For K-Near neighbors**

Set up n_neighbors (k)
```
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors=<number>)
KNN.fit(X_train, y_train)
predictions = KNN.predict(X_test)
KNN.score(X_test, y_test)
```







##Performance metrics

For this part, the sklearn.metrics classes are used

To get the confusion matrix


```
print(confusion_matrix(<test labels>, <prediction>))
```

To get metrics summary (precision, recall, F-score), use classification_report



```
print(classification_report(<testlabels>, <prediction>))
```






##General functions to manipulate data



.head(): Returns the first n rows of a DataFrame. By default, it returns the first 5 rows.

.shape(): Returns a tuple representing the dimensionality of the DataFrame (number of rows, number of columns).

.describe(): Generates descriptive statistics of the DataFrame, such as count, mean, std, min, and max for numerical columns.

.dtypes(): Returns the data types of each column in the DataFrame.

.groupby(): Groups the DataFrame using a mapper or by a series of columns. It is often used for aggregation.

.duplicated(): Returns a Boolean Series indicating whether each row is a duplicate or not.

.dropna(): Removes missing values from the DataFrame. By default, it drops any row with at least one missing value.

