# Session 1 Preparation and exploration of Dataset

1. Download dataset from github repository
2. Explore dataset
3. Application of filtering techniques
4. Ways of cleaning the data
5. Show how to use various python libraries such as pandas for working with the data
6. Hands on excercise

### 1.1 Download dataset


In [None]:
!curl -L -o data.tsv https://raw.githubusercontent.com/auwalsoe/encode_nlp_workshop_2023/main/data/papyrus_data.tsv
!pip install gradio

### 1.2 Data exploration

In [None]:
import pandas as pd

data = pd.read_csv('/content/data.tsv', delimiter = '\t')
#data = pd.read_csv('data.tsv', delimiter = '\t')

In [None]:
data.head()

### 1.3 Application of filtering techniques

In [None]:
data = data[data.provenance != 'unknown']
data =data[data.groupby('provenance')['provenance'].transform('count').ge(100)]


translations = data['translation'].values
provenance = data['provenance'].values



In [None]:
translations[:10]

### 1.4 Cleaning the data

### 1.5 Show how to use various libraries suchs as pandas for working with the text data

### 1.6 Hands-on excercise: Cleaning and filtering of the dataset

# Session 2: Introduction to nlp techniques
1. Tokenization
2. Stopword removal
3. Vectorization
4. Stemming
5. (Optional) Lemmatization
6. Show how these techniques can be applied on papyrus data using NLTK
7. Hands-on exercise: Apply NLP techniques to papyrus dataset

### 2.1 Tokenization

### 2.2 Stopword removal

### 2.3 Vectorization

### 2.4 Stemming

### 2.5 Lemmatization (optional)

### 2.6 Application of the abovementioned techniques using NLTK

### 2.7 Hands-on exercise: Apply NLP techniques to papyrus dataset

## Session 3: Introduction to machine learning techniques and building a text classification model
1. Introduction to machine learning
2. Supervised learning and unsupervised learning
3. Building a text classification model
    1. Choose what to predict and which variables to use
    2. Split data into training and test
    3. Transform/vectorize data
    4. Train a logistic regression model
    5. Test and evaluate metrics
    6. Deploy model with Gradio
4. Hands-on exercise: Build and train your own classification model 
5. Wrap-up

### 3.1 Introduction to machine learning

### 3.2 Supervised learning and unsupervised learning

### 3.3 Building a text classification model

#### 3.3.1 Choose what to predict and which variables to use

#### 3.3.2 Split data into training and test

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(translations, provenance, test_size=0.33, random_state=42)

#### 3.3.3 Transform/vectorize data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer()

X_train= tfidf_vect.fit_transform(X_train)
X_test = tfidf_vect.transform(X_test)

#### 3.3.4 Train a logistic regression model

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train,y_train)

#### 3.3.5 Test and evaluate metrics

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

#### 3.3.6 Deploy model with Gradio                                  

In [None]:
import gradio as gr

def find_my_provenance(text):
    text_tfidf = tfidf_vect.transform([text])
    return str(model.predict(text_tfidf)[0])
demo = gr.Interface(fn = find_my_provenance, inputs=gr.Textbox(lines=3,placeholder="Papyrus translation here"), outputs="text")
demo.launch(share=True)

### 3.4 Hands-on exercise: Build and train your own classification model

### 3.5 Wrap-up