# II- Deuxième partie

we modify the prepare_data method to include a text cleaning step: 
#### Regular Expression:
    We use the re.sub function from the re module to replace all non-alphabetic characters in the text with an empty string, effectively removing them.

#### Cleaning Step:
    The cleaning step is performed on the entire text before splitting it into sentences. This ensures that non-alphabetic characters are removed from the entire corpus.

#### Updated Data Preparation:
    After cleaning, the rest of the prepare_data method remains the same, processing the cleaned text to generate the preprocessed corpus for training.

In [6]:
import random
import numpy as np
import math
from collections import defaultdict
import re


In [13]:
import re
import math
from collections import defaultdict, Counter
import numpy as np

class NgramLanguageModel:
    def __init__(self):
        self.trigram_counts = defaultdict(int)
        self.bigram_counts = defaultdict(int)
        self.unigram_counts = defaultdict(int)
        self.vocab = set()
        self.k = 0.01  # Smoothing parameter
        self.start_token = '<s>'
        self.end_token = '</s>'

    def prepare_data(self, data, ngram_size=2, is_file=True):
        if is_file:
            with open(data, 'r') as f:
                text = f.read().lower()
        else:
            text = data.lower()
 
        # Clean the text to remove non-alphabetic characters
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        
        sentences = text.split('\n')
        preprocessed_sentences = []

        for sentence in sentences:
            tokens = sentence.split()
            if ngram_size == 2:
                tokens = [self.start_token] + tokens + [self.end_token]
            elif ngram_size == 3:
                tokens = [self.start_token, self.start_token] + tokens + [self.end_token]
            preprocessed_sentences.append(' '.join(tokens))

        preprocessed_corpus = ' '.join(preprocessed_sentences)

        words = preprocessed_corpus.split()
        word_counts = Counter(words)
        self.vocab = {word for word in word_counts if word_counts[word] >= 1}
        self.vocab.add('<UNK>')

        def replace_oov(word):
            return word if word in self.vocab else '<UNK>'

        preprocessed_corpus = ' '.join(replace_oov(word) for word in words)

        return preprocessed_corpus

    def train(self, ngram_size=2, infile='shakespeare.txt'):
        preprocessed_corpus = self.prepare_data(infile, ngram_size, is_file=True)
        tokens = preprocessed_corpus.split()

        if ngram_size == 2:
            for i in range(len(tokens) - 1):
                self.bigram_counts[(tokens[i], tokens[i+1])] += 1
        elif ngram_size == 3:
            for i in range(len(tokens) - 2):
                self.trigram_counts[(tokens[i], tokens[i+1], tokens[i+2])] += 1

    def predict_ngram(self, sentence, ngram_size=2):
        preprocessed_sentence = self.prepare_data(sentence, ngram_size, is_file=False)
        tokens = preprocessed_sentence.split()
        log_prob = 0.0

        if ngram_size == 2:
            for i in range(len(tokens) - 1):
                log_prob += self.calculate_log_prob_bigram(tokens[i], tokens[i+1])
        elif ngram_size == 3:
            for i in range(len(tokens) - 2):
                log_prob += self.calculate_log_prob_trigram(tokens[i], tokens[i+1], tokens[i+2])

        return log_prob

    def calculate_log_prob_bigram(self, word1, word2):
        count_bigram = self.bigram_counts[(word1, word2)]
        count_unigram = sum(self.bigram_counts[(word1, w)] for w in self.vocab)
        vocab_size = len(self.vocab)
        prob = (count_bigram + self.k) / (count_unigram + self.k * vocab_size)
        return math.log(prob)

    def calculate_log_prob_trigram(self, word1, word2, word3):
        count_trigram = self.trigram_counts[(word1, word2, word3)]
        count_bigram = sum(self.trigram_counts[(word1, word2, w)] for w in self.vocab)
        vocab_size = len(self.vocab)
        prob = (count_trigram + self.k) / (count_bigram + self.k * vocab_size)
        return math.log(prob)

    def test_perplexity(self, test_file, ngram_size=2):
        total_log_prob = 0.0
        total_tokens = 0

        with open(test_file, 'r') as f:
            for line in f:
                sentence = line.strip().lower()
                total_log_prob += self.predict_ngram(sentence, ngram_size)
                total_tokens += len(sentence.split()) + 1  # Adding 1 for the end token

        avg_log_prob = total_log_prob / total_tokens
        perplexity = math.exp(-avg_log_prob)
        return perplexity

    def generate_text(self, ngram_size=2, max_length=200):
        if ngram_size == 2:
            current_token = self.start_token
            text = []
            sentence_count = 0
            for _ in range(max_length):
                next_token = self.sample_next_word(current_token, ngram_size)
                if next_token == self.end_token:
                    sentence_count += 1
                    if sentence_count >= 8:
                        break
                    current_token = self.start_token
                else:
                    text.append(next_token)
                    current_token = next_token
            return ' '.join(text)
        elif ngram_size == 3:
            current_tokens = [self.start_token, self.start_token]
            text = []
            sentence_count = 0
            for _ in range(max_length):
                next_token = self.sample_next_word(tuple(current_tokens), ngram_size)
                if next_token == self.end_token:
                    sentence_count += 1
                    if sentence_count >= 8:
                        break
                    current_tokens = [self.start_token, self.start_token]
                else:
                    text.append(next_token)
                    current_tokens = [current_tokens[1], next_token]
            return ' '.join(text)

    def sample_next_word(self, current_token, ngram_size):
        if ngram_size == 2:
            candidates = [(key[1], self.bigram_counts[key]) for key in self.bigram_counts if key[0] == current_token]
        elif ngram_size == 3:
            candidates = [(key[2], self.trigram_counts[key]) for key in self.trigram_counts if (key[0], key[1]) == current_token]

        if not candidates:
            return self.end_token

        words, counts = zip(*candidates)
        total = sum(counts)
        probs = [count / total for count in counts]

        return np.random.choice(words, p=probs)

    def auto_complete(self, prefix, ngram_size=2):
        tokens = prefix.lower().split()
        if ngram_size == 2:
            last_token = tokens[-1]
            next_word = self.sample_next_word(last_token, ngram_size)
        elif ngram_size == 3:
            last_two_tokens = tokens[-2:]
            next_word = self.sample_next_word(tuple(last_two_tokens), ngram_size)
        return next_word

    def correction(self, word, ngram_size=2):
        # This method requires additional implementation for spelling correction
        pass





#### Initialisation du modèle:

crée une instance de NgramLanguageModel.


In [14]:
# Exemple d'utilisation :
model = NgramLanguageModel()

### Entraînement du modèle:bigramme

 entraîne le modèle bigramme avec le fichier shakespeare.txt.

In [15]:
model.train(ngram_size=2, infile='big_data.txt')

In [16]:
print("Perplexity (Bigram) on test file:", model.test_perplexity(test_file='big_data.txt', ngram_size=2))


Perplexity (Bigram) on test file: 8.273168431638918


#### Génération de texte:

génère un texte basé sur le modèle bigramme.

In [17]:
print("Generated text (Bigram):", model.generate_text(ngram_size=2, max_length=200))


Generated text (Bigram): love your favorite kurt vile and technical and bricks at second halfksu mu what they not and tuesdays does not the sidewalk smash isnt known defeat of happiness never shout out the stock bump on the super early tmrwww youknowlifeisgreatwhen years since they could i like the rt i will pass out the tweets love youu but we will also looks like chandler man what call tonightthanks and jerome bronson doing down in atlanta will outperform nat library want a thug by great time


In [18]:
print("Generated text (Bigram):", model.generate_text(ngram_size=2, max_length=200))

Generated text (Bigram): i have a nugget nectar by earnings or sun hahahah dude im moving forward to classy very interestinq wow mobile web are moving truck will be for ne hes all craft beer seems as great meeting after one hell and nuts it digging the concert thanks for c gallagher will be great job min left evr was a really fun and then not the tivo in the tweets i miss free ebook exhibitors out to s do but youre obviously very well that there ever think helping peers out on the grammar as seconds of dust


In [19]:
print("Generated text (Bigram):", model.generate_text(ngram_size=2))

Generated text (Bigram): im stuck in the softball game eh walking youre driving more than vic for them saw that needs to control when its so i should be pushed us about it a good one of a super day yea did reason these prom cool and prayers my cincinnati its the worst well he was great to make it was vanilla pudding milkshake steve perry and the crosby sweepstakes today i wish the cursive it makes no possibility of project in mn for nola off the ass


#### Auto-complétion:
 prédit le mot suivant le plus probable après "I am" en utilisant le modèle bigramme.


In [20]:
print("Autocomplete for 'I am':", model.auto_complete('I am', ngram_size=2))

Autocomplete for 'I am': dizzy


In [21]:
print("Autocomplete for 'Please':", model.auto_complete('Please', ngram_size=2))

Autocomplete for 'Please': call


In [22]:
print("Autocomplete for 'I ':", model.auto_complete('I', ngram_size=2))

Autocomplete for 'I ': havent


In [23]:
print("Autocomplete for 'we were':", model.auto_complete('we were', ngram_size=2))

Autocomplete for 'we were': amazing


In [26]:
print("Autocomplete for 'do ':", model.auto_complete('do', ngram_size=2))

Autocomplete for 'do ': any


### Entraînement du modèle:¶ Trigramme 
entraîne le modèle Trigramme avec le fichier shakespeare.txt.

In [27]:
model.train(ngram_size=3, infile='big_data.txt')

In [28]:
print("Perplexity (Trigram) on test file:", model.test_perplexity(test_file='big_data.txt', ngram_size=3))

Perplexity (Trigram) on test file: 2.0651289642857544


#### Génération de texte:

 génère un texte basé sur le modèle Trigramme.

In [29]:
print("Generated text (Trigram):", model.generate_text(ngram_size=3))

Generated text (Trigram): i do is win congrats jim good luck finding this one a tiny attic or crawl space if i slam this manila folder into the playoffs so the beast can show you what the fuck brings grenades to a soccer goal obviously i lost it in israel and he alone is responsible for what if you want to marry him some love and i will be a reviewer let us know if you make my way there are certain colors that dont walk like see your rendition of swipes on swipes lol im glad we have is better than the seeing of dolls murder at the love of all you want to try peeps next time you have to be around much today kiss your mommy for me great picture the green shirt with ur shortcmings they will learn anyway can u chew gum walk a mile in his songs he takes lots of fun see you tonight


In [31]:
print("Generated text (Trigram):", model.generate_text(ngram_size=3))

Generated text (Trigram): seal team rocks congrats on a shoot out salute follow him for a reason not to mention a masterpiece of reptilian westerns i am on twitter in was crazy sheesh just think about that one thing iss on another week then the next one lakers stop making excuses for kobe since they are eating may be the one night and invite people over hmm gotta pick a subject area or just a few new songs i wanna go tanning again uh gmorning good to their purpose thats what its all mine muahahha yes i did with it


#### Auto-complétion:

prédit le mot suivant le plus probable en utilisant le modèle Trigramme.

In [30]:
print("Autocomplete for 'I am':", model.auto_complete('I am', ngram_size=3))

Autocomplete for 'I am': crazy


In [40]:
print("Autocomplete for 'do you':", model.auto_complete('do you', ngram_size=3))

Autocomplete for 'please': still


In [42]:
print("Autocomplete for 'do you':", model.auto_complete('do you', ngram_size=3))

Autocomplete for 'do you': ever
