<a href="https://colab.research.google.com/github/elhamod/IS813/blob/main/Week1/IS813_2024_Week1_pre_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS813 Week 1: Basic ML and Language Modeling

1. use Google Colab for this assignment.

2. **You are allowed to use ChatGPT for this assignment. However, as per the syllabus, you are required to cite your usage and submit the prompts and responses used as a PDF file. You are also responsible for understanding the solution and defending it when asked in class.**

3. For each question, fill in the answer in the cell(s) right below it. The answer could be code or text. You can add as many cells as you need for clarity.

4. Enter your BUID (only numerical part) below.

5. **Your submission on Blackboard should be the downloaded notebook (i.e., ipynb file). It should be prepopulated with your solution (i.e., the TA and/or instructor need not rerun the notebook to inspect the output). The code, when executed by the TA and/or instructor, should run with no runtime errors.**

#Part 1: Pre-class Work

## 1.1 Setup

In [None]:
BUID = 123456 #e.g., 123456 ONLY NUMERICAL PART

 Machine learning is generally stochastic, meaning you get different results for different runs. To avoid that, you can "seed" your code. This code uses your BU id (only the numeric part) as a seed for all random number generators.

In [None]:
import random
import numpy as np
import torch

# Set a seed for the built-in Python random module
random.seed(BUID)
# Set a seed for NumPy
np.random.seed(BUID)

In this part, we will be working with medical data to profile patients with diabetes.

## 1.2 Diabetes Dataset loading



We will first download and take a look at the Kaggle diabetes dataset. The dataset has patient profiles and whether they have diabetes encoded in the `Outcome` columns.

In [None]:
!pip install kaggle
!kaggle datasets download -d mathchi/diabetes-data-set
!unzip /content/diabetes-data-set.zip

Dataset URL: https://www.kaggle.com/datasets/mathchi/diabetes-data-set
License(s): CC0-1.0
Downloading diabetes-data-set.zip to /content
  0% 0.00/8.91k [00:00<?, ?B/s]
100% 8.91k/8.91k [00:00<00:00, 16.6MB/s]
Archive:  /content/diabetes-data-set.zip
  inflating: diabetes.csv            


In [None]:
import pandas as pd
from IPython.display import display


# Load the dataset using the downloaded csv file
df = pd.read_csv('/content/diabetes.csv')

# Display the dataset head of the dataset
display(df.head())

# Display the statistics of the table
display(df.describe())

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0




## 1.3 Using Decision Trees



Now, let's see if we can learn a useful model from this data. Given a patient's profile, we want to predict accurately whether they have diabetes. In this notebook, you will use a [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

  Your task is to:

   1. Split the data into training and testing. You may refer to [this link](https://www.geeksforgeeks.org/pandas-create-test-and-train-samples-from-dataframe/) for help. Use 80% of the data for training, and the rest for testing. __(10 points)__

2. Train the tree over the training data. __(5 points)__

3. Report the accuracy over both the training and test sets. __(5 points)__

    
  Let's start with a decision tree that has a depth of 5 (i.e., the tree asks a maximum of 4 questions before reaching a conclsion).


In [None]:
from sklearn.model_selection import train_test_split


### Split the dataframe into df_train nd df_test.

### Split each of df_train and df_test by column into an input table and an outcome table.


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

depth = 5

## Create the decision tree

## fit the model to the training data


4. What is the accuracy of the model on both the training and test sets? __(5 points)__

In [None]:
## get accuracy on both training and test data.


Now, use the next cell to write code that creates 10 decision tress with different sizes (i.e., depths between 1 and 10, inclusive).

  5. Using a loop, train these different trees. Save their training and test accuracies. __(10 points)__

  6. Plot the training and test accuracies as two lines, with the x axis as the tree depth. Format your plot properly with legends and colors. __(5 points)__

In [None]:
import matplotlib.pyplot as plt

train_accurcies = []
test_accuracies = []


## Train and save the accuracies of multiple trees in a loop.
range_of_depths = range(1,11)
for depth in range_of_depths:



# Plotting
plt.figure(figsize=(10, 6))
plt.plot(range_of_depths, train_accurcies, label='Training Accuracy', color='blue', marker='o')
plt.plot(range_of_depths, test_accuracies, label='Test Accuracy', color='red', marker='o')
plt.xlabel('Depth')
plt.ylabel('Accuracy')
plt.title('Training and Test Accuracies vs. Tree Depth')
plt.legend()
plt.grid(True)
plt.show()

# Part 2: In-class Work

## 2.1 Reflective Questions

1. From the figure in 1.3.6, can you identify which models are overfitting, underfitting, or fit well? Elaborate.

2. what does the depth reflect in machine learning lingo as discussed in class? Elaborate.

**Answers**


## 2.2 The Effect of Data Size

3. Let's go back to our best decision tree found in 1.3.6 but change one thing: What if we decide to only use 20% of the original data for training, and the rest for testing? How does that change the results? What is the explanation behind this change?

In [None]:
### Split df into df_train nd df_test.

### Split df_train by column into the input columns and the outcome columns



In [None]:
# Train tree


# Get and print scores


**Answer**


##2.5 Language Modeling with N-grams



Let's focus on language modeling now! In this section, you will create some n-grams and experiment with how they work.

### 2.5.1 Setup

Use `nltk` to a 2-gram (i.e., bigram). Extract the bigrams in the sentence

> This is a sample sentence.

In [None]:
!pip install nltk
import nltk
# Ensure you have the tokenizers
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import nltk
from nltk.util import ngrams

sentence = "This is a sample sentence."

n = 2

# Get the tokens of the sentence.


# Create the model


# Print the bigrams




### 2.5.1 Creating an N-gram Based on a Text Corpus.

Using [this functionality](https://www.nltk.org/api/nltk.lm.api.html) in `nltk`, Create a bigram based on the following dataset of sentences:

        - to be or not to be. that is the question!
        - ask not what your country can do for you. Ask what you can do for your country.
        - is this the real life? is this just fantasy?

1. Show the bigram you have constructed (i.e., the dictionary).
2. Generate 10 new sentences. What do you notice about these sentences? Explain what's interesting about your observation(s).

In [None]:
sentences = [
    "To be or not to be. that is the question!",
    "Ask not what your country can do for you. Ask what you can do for your country.",
    "Is this the real life? is this just fantasy?"
]

In [None]:
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

# tokenize the sentences

# prepare the data using padded_everygram_pipeline

# train the model

In [None]:
from collections import Counter

ngrams_freq = Counter()

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

for ngram in train_data:
    ngrams_freq.update(ngram)

# Display n-grams and their frequencies
print("N-grams and their frequencies:")
for ngram, freq in ngrams_freq.items():
    print(f"{ngram}: {freq}")

In [None]:
def generate_sentence(model, num_words=15, start="to"):
    content = []
    word = start

    for _ in range(num_words):
        # update the prefix

        # Generate the next word based on the last n-1 words

        # if "end of sentence" is reached, exit
        if word == '</s>':
            break

    # Convert the list of words into a string.
    return


In [None]:
num_sentences = 10
start_word = "Ask"

for i in range(num_sentences):
    print(f"Sentence {i + 1}:")
    print(generate_sentence(model, 15, start=start_word))
    print("------")



**Answers**


Now, you will read the file `https://raw.githubusercontent.com/elhamod/IS883/main/Assignments/Week1/IS883_Week1_bustlingcity.txt`. You will create multiple __n-grams__, where n ={2, 3, 4, 5, 10}. You will then, using each n-gram, generate a text of similar length to the original file.

3. Compare the different generated texts. What observations do you make? Explain your observations with examples. __(0.5 points)__

In [None]:
import requests

## Read the text
url = "https://raw.githubusercontent.com/elhamod/IS883/main/Assignments/Week1/IS883_Week1_bustlingcity.txt"
response = requests.get(url)
text_content = response.text

# print the original text
print("original text:")
print(text_content)
print("--------")



In [None]:
for n in [2, 3, 4, 5, 10]:
  print("n =", n)

  # tokenize the sentences


  # prepare the data using padded_everygram_pipeline

  # train the model

  # print
  start_word = "<s>" # start a new sentence


  print("----------")

**Answer**



# Part 3: Homework

1. [Calculate the perplexity](https://www.nltk.org/api/nltk.lm.api.html) of the following sentences for the original `n in [2,3,4,5,10]` models in 2.5.1.3 (i.e., not including the reverse models). __(10 points)__

2. Comment on the results and elaborate on your findings.  __(10 points)__

 > In the heart of the bustling city,

 > There is a park. The park is beautiful.

In [None]:
def calculate_perplexity(text, model, n):
  return ####

In [None]:
for n in [2,3,4,5]:

  # tokenize the sentences


  # prepare the data using padded_everygram_pipeline


  # Create and train the model


  # Print

  print("*****")

**Answer**



##3.2 AI Legal Assistant



In order to measure how well machine learning could be used for legal assistance, the bar association has hired you to curate a dataset of a large corpora of legal documents for training and testing different machine learning models. Once the dataset is curated [(e.g. this)](https://www.kaggle.com/datasets/anudit/india-legal-cases-dataset), many researchers and practitioners will bid and use the publicized dataset to demonstrate the superiority of their model.

1. Can you think of a potential issue with such a practice in terms of model quality? __(5 points)__
2. Can you suggest remedies that are easy to implement for such issue(s)? __(5 points)__

**Answers**
