# Custom Functions

This notebook contains the documentation of custom functions we defined while building models. 

Here are the steps to import those functions:

```python
!pip install requests
```

```python
import requests
```

```python
url = 'https://raw.githubusercontent.com/danieldovale/DMML2022_Tissot/main/code/custom_functions.py'
r = requests.get(url)


with open('custom_functions.py', 'w') as f:
    f.write(r.text)
print(r.text)

import custom_functions as cfun
```
In order to call those functions, use prefix cfun.
> For example: 
```python 
cfun.evaluate(y_test, y_pred)
```


---

# 1. evaluate()



```python
def evaluate(true, pred):
    precision = precision_score(true, pred, average = 'weighted')
    recall = recall_score(true, pred, average = 'weighted')
    f1 = f1_score(true, pred, average = 'weighted')
    acc = accuracy_score(true, pred)
    index = 'result'
    d = {'accuracy': round(acc,4), 'precision': round(precision,4), 'recall': round(recall,4), 'f1 score': round(f1,4) }
    df = pd.DataFrame(d,index=["results"])
    sns.heatmap(pd.DataFrame(confusion_matrix(true, pred)), annot=True, cmap='Oranges', fmt='.7g');
    return df
```





The function `evaluate` takes in two arguments: `true` and `pred`. These are both lists or arrays of the same length, containing the true labels and predicted labels respectively for a classification task.

The function then calculates the following evaluation metrics:

* `precision`: the ratio of true positive predictions to all positive predictions    
*`recall`: the ratio of true positive predictions to all actual positive instances              
*`F1 score`: the harmonic mean of precision and recall                       
*`accuracy`: the ratio of correct predictions to the total number of predictions           

The function also generates a confusion matrix using the confusion_matrix function, which visualizes the number of true positive, true negative, false positive, and false negative predictions. Then the function plots a confusion matrix using the sns.heatmap function.

The evaluation metrics are then stored in a dictionary and used to create a Pandas dataframe, which is returned by the function.





---

## 2. prediction()



```python
def prediction(data, name, download = False):
    df = pd.DataFrame(data = data)
    df.index.names = ['id']
    df.rename(columns = {0:'difficulty'}, inplace = True)
    file_name = name + ".csv"
    df.to_csv(file_name)
    if download == True:
      files.download(file_name)
    return df.head()
```



The prediction function takes in three arguments:

* `data`: a list or array of data
* `name`: a string representing the name to be used for the resulting CSV file
* `download`: a boolean indicating whether or not to download the CSV file (defaults to False)
The function first creates a Pandas dataframe from the input data, using the first element of each item in the data list as the index and the second element as the value in the 'difficulty' column. It then renames the index to 'id' and renames the '0' column to 'difficulty'.

The function then saves the dataframe to a CSV file with the specified name. If the download argument is set to True, the function uses the files.download function from the Google Colab library to download the file. Finally, the function returns the first few rows of the dataframe.

---

# 3. spacy_tokenizer() and get_info



```python
def spacy_tokenizer(sentence):
    doc = sp(sentence)
    stop_words = nltk.corpus.stopwords.words("french")
    punctuations = string.punctuation
  
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in doc ]
    mytokens = [ word for word in mytokens if word not in punctuations and word not in stop_words ]
    return mytokens

```



The `spacy_tokenizer` function takes in a single argument: a string representing a sentence. It is used in the function `get_info` defined below.

The function first uses the spacy library to process the sentence and create a doc object, which contains various information about the sentence, such as its parts of speech, dependencies, and lemmatized forms of the words.

The function then defines two lists: `stop_words` and `punctuations`. The stop_words list contains French stop words, and the punctuations list contains all punctuation characters.

The function then creates a list of tokens from the doc object and extracts the lemmatized and lowercase form of each word if it is not a pronoun, or the lowercase form of the word if it is a pronoun. The tokens are then filtered to remove any items that appear in the `punctuations` or `stop_words` lists.



```python
def get_info(df):
 
    text_length = []                          
    number_of_sentences = []
    number_of_words = []
    sent_length_avg = []
    words_length_avg = []
    number_of_words_after_lemma_stop = []
    longest_word_size = []
    
    for text in tqdm(df['sentence'].values):
        
      initial_length = len(text)
      text_length.append(initial_length)

      num_sentences = len(sent_tokenize(text))
      number_of_sentences.append(num_sentences)
        
      punctuations = string.punctuation
      text2 = text.lower()
      text2 = word_tokenize(text2)
      text2 = [word for word in text2 if word not in punctuations]
      num_words = len(text2)
      number_of_words.append(num_words)

      sent_length_avg.append(num_words/num_sentences)
        
      words_length_avg.append(initial_length/num_words)

      text = sp(text)
      text = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in text]
      text = [word for word in text if not word in spacy.lang.fr.stop_words.STOP_WORDS and word not in punctuations]

      num_words_after_lemma_stop = len(text)
      number_of_words_after_lemma_stop.append(num_words_after_lemma_stop)

      word_len = [len(w) for w in text2]
      longest_word_size.append(np.max(word_len))
        
    final_df = pd.concat([pd.Series(text_length), pd.Series(number_of_sentences),
                             pd.Series(number_of_words), pd.Series(sent_length_avg),
                             pd.Series(words_length_avg), pd.Series(number_of_words_after_lemma_stop),
                             pd.Series(longest_word_size)], axis = 1)
    final_df.columns = ["text_length", "number_of_sentences", "number_of_words",
                           "sent_length_avg", "words_length_avg",
                           "number_of_words_after_lemma_stop", "longest_word_size"]
    
    return final_df
```



The `get_info` function takes in a single argument: a Pandas dataframe `df` with a column called 'sentence'.

The function initializes several empty lists to store various statistics about the sentences in the dataframe. It then iterates through the 'sentence' column of the dataframe using a progress bar provided by the `tqdm` function.

For each sentence in the dataframe, the function performs the following operations:

1. It calculates the initial length of the sentence and appends it to the 
`text_length` list.
2. It uses the `sent_tokenize` function from the `nltk` library to split the sentence into individual sentences, and then counts the number of sentences. This value is appended to the `number_of_sentences` list.
3. It tokenizes the sentence into words using the `word_tokenize` function from the `nltk` library, removes punctuation, and counts the number of words. This value is appended to the number_of_words list.
4. It calculates the average number of words per sentence by dividing the `number_of_words` by the `number_of_sentences` and appends the result to the `sent_length_avg` list.
5. It calculates the average length of the words in the sentence by dividing the `text_length` by the `number_of_words` and appends the result to the `words_length_avg` list.
6. It processes the sentence using the `spacy` library to lemmatize the words and remove stop words. It then counts the number of remaining words and appends the result to the `number_of_words_after_lemma_stop` list.
7. It calculates the length of each word in the sentence, finds the longest word, and appends its length to the `longest_word_size` list.

After iterating through all the sentences in the dataframe, the function creates a new dataframe from the statistics lists and assigns appropriate column names. Finally, it returns the resulting dataframe.