<a href="https://colab.research.google.com/github/canhbd/StickyTableHeaders/blob/master/fluency_acc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Portfolio - Final Project

**Instructions for Students:**

Please carefully follow these steps to complete and submit your assignment:

1. **Completing the Assignment**: You are required to work on and complete all tasks in the provided assignment. Be disciplined and ensure that you thoroughly engage with each task.
   
2. **Creating a Google Drive Folder**: If you don't previously have a folder for collecting assignments, you must create a new folder in your Google Drive. This will be a repository for all your completed assignment files, helping you keep your work organized and easy to access.
   
3. **Uploading Completed Assignment**: Upon completion of your assignment, make sure to upload all necessary files, involving codes, reports, and related documents into the created Google Drive folder. Save this link in the 'Student Identity' section and also provide it as the last parameter in the `submit` function that has been provided.
   
4. **Sharing Folder Link**: You're required to share the link to your assignment Google Drive folder. This is crucial for the submission and evaluation of your assignment.
   
5. **Setting Permission toPublic**: Please make sure your **Google Drive folder is set to public**. This allows your instructor to access your solutions and assess your work correctly.

Adhering to these procedures will facilitate a smooth assignment process for you and the reviewers.

**Description:**

Welcome to your final portfolio project assignment for AI Bootcamp. This is your chance to put all the skills and knowledge you've learned throughout the bootcamp into action by creating real-world AI application.

You have the freedom to create any application or model, be it text-based or image-based or even voice-based or multimodal.

To get you started, here are some ideas:

1. **Sentiment Analysis Application:** Develop an application that can determine sentiment (positive, negative, neutral) from text data like reviews or social media posts. You can use Natural Language Processing (NLP) libraries like NLTK or TextBlob, or more advanced pre-trained models from transformers library by Hugging Face, for your sentiment analysis model.

2. **Chatbot:** Design a chatbot serving a specific purpose such as customer service for a certain industry, a personal fitness coach, or a study helper. Libraries like ChatterBot or Dialogflow can assist in designing conversational agents.

3. **Predictive Text Application:** Develop a model that suggests the next word or sentence similar to predictive text on smartphone keyboards. You could use the transformers library by Hugging Face, which includes pre-trained models like GPT-2.

4. **Image Classification Application:** Create a model to distinguish between different types of flowers or fruits. For this type of image classification task, pre-trained models like ResNet or VGG from PyTorch or TensorFlow can be utilized.

5. **News Article Classifier:** Develop a text classification model that categorizes news articles into predefined categories. NLTK, SpaCy, and sklearn are valuable libraries for text pre-processing, feature extraction, and building classification models.

6. **Recommendation System:** Create a simplified recommendation system. For instance, a book or movie recommender based on user preferences. Python's Surprise library can assist in building effective recommendation systems.

7. **Plant Disease Detection:** Develop a model to identify diseases in plants using leaf images. This project requires a good understanding of convolutional neural networks (CNNs) and image processing. PyTorch, TensorFlow, and OpenCV are all great tools to use.

8. **Facial Expression Recognition:** Develop a model to classify human facial expressions. This involves complex feature extraction and classification algorithms. You might want to leverage deep learning libraries like TensorFlow or PyTorch, along with OpenCV for processing facial images.

9. **Chest X-Ray Interpretation:** Develop a model to detect abnormalities in chest X-ray images. This task may require understanding of specific features in such images. Again, TensorFlow and PyTorch for deep learning, and libraries like SciKit-Image or PIL for image processing, could be of use.

10. **Food Classification:** Develop a model to classify a variety of foods such as local Indonesian food. Pre-trained models like ResNet or VGG from PyTorch or TensorFlow can be a good starting point.

11. **Traffic Sign Recognition:** Design a model to recognize different traffic signs. This project has real-world applicability in self-driving car technology. Once more, you might utilize PyTorch or TensorFlow for the deep learning aspect, and OpenCV for image processing tasks.

**Submission:**

Please upload both your model and application to Huggingface or your own Github account for submission.

**Presentation:**

You are required to create a presentation to showcase your project, including the following details:

- The objective of your model.
- A comprehensive description of your model.
- The specific metrics used to measure your model's effectiveness.
- A brief overview of the dataset used, including its source, pre-processing steps, and any insights.
- An explanation of the methodology used in developing the model.
- A discussion on challenges faced, how they were handled, and your learnings from those.
- Suggestions for potential future improvements to the model.
- A functioning link to a demo of your model in action.

**Grading:**

Submissions will be manually graded, with a select few given the opportunity to present their projects in front of a panel of judges. This will provide valuable feedback, further enhancing your project and expanding your knowledge base.

Remember, consistent practice is the key to mastering these concepts. Apply your knowledge, ask questions when in doubt, and above all, enjoy the process. Best of luck to you all!


In [None]:
# @title #### Student Identity
student_id = "REAV410F" # @param {type:"string"}
name = "Johnson Rouslie Junior" # @param {type:"string"}
drive_link = "https://drive.google.com/drive/u/0/folders/1ZFMFGgkLRSfkZsAFY2-oGJccQejpHD8M"  # @param {type:"string"}
assignment_id = "00_portfolio_project"

## Installation and Import `rggrader` Package

In [None]:
%pip install rggrader
from rggrader import submit_image
from rggrader import submit

Collecting rggrader
  Downloading rggrader-0.1.6-py3-none-any.whl (2.5 kB)
Installing collected packages: rggrader
Successfully installed rggrader-0.1.6


## Working Space

In [None]:
!pip install datasets transformers accelerate evaluate gradio



### Login

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Load Dataset

In [None]:
from datasets import load_dataset, Audio
pronunciation = load_dataset("mispeech/speechocean762", split="train+test")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.54k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/296M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2500 [00:00<?, ? examples/s]

The aims of this project are to predict the accuracy, fluency, and prosodic from the english sentence pronunciation audio. So we remove the unnecessary columns from the dataset.

In [None]:
pronunciation = pronunciation.remove_columns(["completeness", "text", "total", "words", "speaker", "gender", "age"])
pronunciation

Dataset({
    features: ['accuracy', 'fluency', 'prosodic', 'audio'],
    num_rows: 5000
})

From the creator of the dataset. This are the meaning of the score:
1. Accuracy:
Score range: 0 - 10
  - 9-10: The overall pronunciation of the sentence is excellent, with accurate phonology and no obvious pronunciation mistakes
  - 7-8: The overall pronunciation of the sentence is good, with a few pronunciation mistakes
  - 5-6: The overall pronunciation of the sentence is understandable, with many pronunciation mistakes and accent, but it does not affect the understanding of basic meanings
  - 3-4: Poor, clumsy and rigid pronunciation of the sentence as a whole, with serious pronunciation mistakes
  - 0-2: Extremely poor pronunciation and only one or two words are recognizable

2. Fluency:
Score range: 0 - 10
  - 8-10: Fluent without noticeable pauses or stammering
  - 6-7: Fluent in general, with a few pauses, repetition, and stammering
  - 4-5: the speech is a little influent, with many pauses, repetition, and stammering
  - 0-3: intermittent, very influent speech, with lots of pauses, repetition, and stammering

3. Prosodic
Score range: 0 - 10
  - 9-10: Correct intonation at a stable speaking speed, speak with cadence, and can speak like a native
  - 7-8: Nearly correct intonation at a stable speaking speed, nearly smooth and coherent, but with little stammering and few pauses
  - 5-6: Unstable speech speed, many stammering and pauses with a poor sense of rhythm
  - 3-4: Unstable speech speed, speak too fast or too slow, without the sense of rhythm
  - 0-2: Poor intonation and lots of stammering and pauses, unable to read a complete sentence

Let's see the unique value from each score.

In [None]:
import numpy as np
print(np.unique(pronunciation["accuracy"]))
print(len(np.unique(pronunciation["accuracy"])))

[ 1  3  4  5  6  7  8  9 10]
9


In [None]:
print(np.unique(pronunciation["fluency"]))
print(len(np.unique(pronunciation["fluency"])))

[ 0  1  2  3  4  5  6  7  8  9 10]
11


In [None]:
print(np.unique(pronunciation["prosodic"]))
print(len(np.unique(pronunciation["prosodic"])))

[ 0  1  2  3  4  5  6  7  8  9 10]
11


It seems that the accuracy have missing score. So let's group the score based on the explanation on the score like for example, score 0-2 will be group as 0 with label 'Poor' etc. We will also group the score on fluency and prosodic too.

In [None]:
def change_accuracy(example):
  if 0 <= example['accuracy'] <= 2:
    example['accuracy'] = 0
  elif 3 <= example['accuracy'] <= 4:
    example['accuracy'] = 1
  elif 5 <= example['accuracy'] <= 6:
    example['accuracy'] = 2
  elif 7 <= example['accuracy'] <= 8:
    example['accuracy'] = 3
  else:
    example['accuracy'] = 4
  return example

def change_fluency(example):
  if 0 <= example['fluency'] <= 3:
    example['fluency'] = 0
  elif 4 <= example['fluency'] <= 5:
    example['fluency'] = 1
  elif 6 <= example['fluency'] <= 7:
    example['fluency'] = 2
  else:
    example['fluency'] = 3
  return example

def change_prosodic(example):
  if 0 <= example['prosodic'] <= 2:
    example['prosodic'] = 0
  elif 3 <= example['prosodic'] <= 4:
    example['prosodic'] = 1
  elif 5 <= example['prosodic'] <= 6:
    example['prosodic'] = 2
  elif 7 <= example['prosodic'] <= 8:
    example['prosodic'] = 3
  else:
    example['prosodic'] = 4
  return example

In [None]:
pronunciation = pronunciation.map(change_accuracy)
pronunciation = pronunciation.map(change_fluency)
pronunciation = pronunciation.map(change_prosodic)

In [None]:
print(np.unique(pronunciation["accuracy"]))
print(len(np.unique(pronunciation["accuracy"])))

[0 1 2 3 4]
5


In [None]:
print(np.unique(pronunciation["fluency"]))
print(len(np.unique(pronunciation["fluency"])))

[0 1 2 3]
4


In [None]:
print(np.unique(pronunciation["prosodic"]))
print(len(np.unique(pronunciation["prosodic"])))

[0 1 2 3 4]
5


In [None]:
pronunciation[0]

{'accuracy': 3,
 'fluency': 3,
 'prosodic': 4,
 'audio': {'path': '000010011.wav',
  'array': array([-9.46044922e-04, -2.38037109e-03, -1.31225586e-03, ...,
         -9.15527344e-05,  3.05175781e-04, -2.44140625e-04]),
  'sampling_rate': 16000}}

Time to split the dataset.

In [None]:
pronunciation = pronunciation.train_test_split(test_size=0.2)
pronunciation

DatasetDict({
    train: Dataset({
        features: ['accuracy', 'fluency', 'prosodic', 'audio'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['accuracy', 'fluency', 'prosodic', 'audio'],
        num_rows: 1000
    })
})

In [None]:
pronunciation['train'][0]

{'accuracy': 4,
 'fluency': 3,
 'prosodic': 4,
 'audio': {'path': '021290200.wav',
  'array': array([-0.00158691, -0.00219727, -0.00158691, ..., -0.00036621,
          0.0005188 , -0.00128174]),
  'sampling_rate': 16000}}

We will manually add label to the score.

In [None]:
accuracy_labels = ['Extremely Poor', 'Poor', 'Average', 'Good', 'Excellent']
fluency_labels = ['Very Influent', 'Influent', 'Average', 'Fluent']
prosodic_labels = ['Poor', 'Unstable', 'Stable', 'Almost', 'Perfect']

In [None]:
accuracy_label2id, accuracy_id2label = dict(), dict()
for i, label in enumerate(accuracy_labels):
    accuracy_label2id[label] = str(i)
    accuracy_id2label[str(i)] = label

In [None]:
accuracy_label2id

{'Extremely Poor': '0',
 'Poor': '1',
 'Average': '2',
 'Good': '3',
 'Excellent': '4'}

In [None]:
accuracy_id2label

{'0': 'Extremely Poor',
 '1': 'Poor',
 '2': 'Average',
 '3': 'Good',
 '4': 'Excellent'}

In [None]:
fluency_label2id, fluency_id2label = dict(), dict()
for i, label in enumerate(fluency_labels):
    fluency_label2id[label] = str(i)
    fluency_id2label[str(i)] = label

In [None]:
fluency_label2id

{'Very Influent': '0', 'Influent': '1', 'Average': '2', 'Fluent': '3'}

In [None]:
fluency_id2label

{'0': 'Very Influent', '1': 'Influent', '2': 'Average', '3': 'Fluent'}

In [None]:
prosodic_label2id, prosodic_id2label = dict(), dict()
for i, label in enumerate(prosodic_labels):
    prosodic_label2id[label] = str(i)
    prosodic_id2label[str(i)] = label

In [None]:
prosodic_label2id

{'Poor': '0', 'Unstable': '1', 'Stable': '2', 'Almost': '3', 'Perfect': '4'}

In [None]:
prosodic_id2label

{'0': 'Poor', '1': 'Unstable', '2': 'Stable', '3': 'Almost', '4': 'Perfect'}

In [None]:
accuracy_num_labels = 5
fluency_num_labels = 4
prosodic_num_labels = 5

### Preprocess

In [None]:
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base")

In [None]:
def preprocess_function(examples):
  audio_arrays = [x["array"] for x in examples["audio"]]
  inputs = feature_extractor(
    audio_arrays,
    sampling_rate=feature_extractor.sampling_rate,
    truncation=True,
  )
  return inputs

In [None]:
encoded_pronunciation = pronunciation.map(
    preprocess_function,
    remove_columns="audio",
    batched=True,
    batch_size=2,
    num_proc=1,
)
encoded_pronunciation

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['accuracy', 'fluency', 'prosodic', 'input_features'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['accuracy', 'fluency', 'prosodic', 'input_features'],
        num_rows: 1000
    })
})

We're gonna split accuracy, fluency, and prosodic so the model will be trained on a specific output

In [None]:
encoded_pronunciation_accuracy = encoded_pronunciation.remove_columns(["fluency", "prosodic"])
encoded_pronunciation_fluency = encoded_pronunciation.remove_columns(["accuracy", "prosodic"])
encoded_pronunciation_prosodic = encoded_pronunciation.remove_columns(["accuracy", "fluency"])

In [None]:
encoded_pronunciation_accuracy

DatasetDict({
    train: Dataset({
        features: ['accuracy', 'input_features'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['accuracy', 'input_features'],
        num_rows: 1000
    })
})

In [None]:
encoded_pronunciation_fluency

DatasetDict({
    train: Dataset({
        features: ['fluency', 'input_features'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['fluency', 'input_features'],
        num_rows: 1000
    })
})

In [None]:
encoded_pronunciation_prosodic

DatasetDict({
    train: Dataset({
        features: ['prosodic', 'input_features'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['prosodic', 'input_features'],
        num_rows: 1000
    })
})

### Evaluate

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

In [None]:
import numpy as np

def compute_metrics(eval_pred):
  predictions = np.argmax(eval_pred.predictions, axis=1)
  return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

In [None]:
from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

In [None]:
encoded_pronunciation_accuracy = encoded_pronunciation_accuracy.rename_column("accuracy", "label")
encoded_pronunciation_fluency = encoded_pronunciation_fluency.rename_column("fluency", "label")
encoded_pronunciation_prosodic = encoded_pronunciation_prosodic.rename_column("prosodic", "label")

In [None]:
encoded_accuracy_model = AutoModelForAudioClassification.from_pretrained(
  "openai/whisper-base", num_labels=accuracy_num_labels, label2id=accuracy_label2id, id2label=accuracy_id2label
)

encoded_fluency_model = AutoModelForAudioClassification.from_pretrained(
  "openai/whisper-base", num_labels=fluency_num_labels, label2id=fluency_label2id, id2label=fluency_id2label
)

encoded_prosodic_model = AutoModelForAudioClassification.from_pretrained(
  "openai/whisper-base", num_labels=prosodic_num_labels, label2id=prosodic_label2id, id2label=prosodic_id2label
)

Some weights of WhisperForAudioClassification were not initialized from the model checkpoint at openai/whisper-base and are newly initialized: ['model.classifier.bias', 'model.classifier.weight', 'model.projector.bias', 'model.projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of WhisperForAudioClassification were not initialized from the model checkpoint at openai/whisper-base and are newly initialized: ['model.classifier.bias', 'model.classifier.weight', 'model.projector.bias', 'model.projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of WhisperForAudioClassification were not initialized from the model checkpoint at openai/whisper-base and are newly initialized: ['model.classifier.bias', 'model.classifier.weight', 'model.projector.bias', 'model.projector.weight']
You should probably TRAIN this model on 

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="model/pronunciation_accuracy",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    gradient_accumulation_steps=1,
    warmup_ratio=0.1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

trainer = Trainer(
    model=encoded_accuracy_model,
    args=training_args,
    train_dataset=encoded_pronunciation_accuracy["train"],
    eval_dataset=encoded_pronunciation_accuracy["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.821638,0.639
2,No log,0.834073,0.638
3,No log,0.927742,0.623


Checkpoint destination directory model/pronunciation_accuracy/checkpoint-125 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}
Checkpoint destination directory model/pronunciation_accuracy/checkpoint-250 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Non-default generation parameters: {'max_length': 448

TrainOutput(global_step=375, training_loss=0.733264892578125, metrics={'train_runtime': 2155.8891, 'train_samples_per_second': 5.566, 'train_steps_per_second': 0.174, 'total_flos': 3.448259424e+17, 'train_loss': 0.733264892578125, 'epoch': 3.0})

In [None]:
trainer.push_to_hub()

Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}


CommitInfo(commit_url='https://huggingface.co/JohnJumon/pronunciation_accuracy/commit/355885b07b49a3ea437606ea9088bfbb458f945d', commit_message='End of training', commit_description='', oid='355885b07b49a3ea437606ea9088bfbb458f945d', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
training_args = TrainingArguments(
    output_dir="model/fluency_accuracy",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    gradient_accumulation_steps=1,
    warmup_ratio=0.1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

trainer = Trainer(
    model=encoded_fluency_model,
    args=training_args,
    train_dataset=encoded_pronunciation_fluency["train"],
    eval_dataset=encoded_pronunciation_fluency["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.466444,0.814
2,No log,0.425002,0.823
3,No log,0.52184,0.827


Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635

TrainOutput(global_step=375, training_loss=0.397497802734375, metrics={'train_runtime': 2015.1566, 'train_samples_per_second': 5.955, 'train_steps_per_second': 0.186, 'total_flos': 3.4482150144e+17, 'train_loss': 0.397497802734375, 'epoch': 3.0})

In [None]:
trainer.push_to_hub()

Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}


CommitInfo(commit_url='https://huggingface.co/JohnJumon/fluency_accuracy/commit/7ea5001997ba6668908be2ebbdb6908b07a1ce88', commit_message='End of training', commit_description='', oid='7ea5001997ba6668908be2ebbdb6908b07a1ce88', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
training_args = TrainingArguments(
    output_dir="model/prosodic_accuracy",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    gradient_accumulation_steps=1,
    warmup_ratio=0.1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

trainer = Trainer(
    model=encoded_prosodic_model,
    args=training_args,
    train_dataset=encoded_pronunciation_prosodic["train"],
    eval_dataset=encoded_pronunciation_prosodic["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.768442,0.67
2,No log,0.675357,0.728
3,No log,0.719434,0.726


Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635

TrainOutput(global_step=375, training_loss=0.612763916015625, metrics={'train_runtime': 2030.8283, 'train_samples_per_second': 5.909, 'train_steps_per_second': 0.185, 'total_flos': 3.448259424e+17, 'train_loss': 0.612763916015625, 'epoch': 3.0})

In [None]:
trainer.push_to_hub()

Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}


CommitInfo(commit_url='https://huggingface.co/JohnJumon/prosodic_accuracy/commit/63c027b213918ff97735fee44ec9d11dd9ef18a7', commit_message='End of training', commit_description='', oid='63c027b213918ff97735fee44ec9d11dd9ef18a7', pr_url=None, pr_revision=None, pr_num=None)

### Inference

In [None]:
from transformers import pipeline
accuracy_classifier = pipeline("audio-classification", model="JohnJumon/pronunciation_accuracy")
fluency_classifier = pipeline("audio-classification", model="JohnJumon/fluency_accuracy")
prosodic_classifier = pipeline("audio-classification", model="JohnJumon/prosodic_accuracy")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/2.52k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/82.9M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/82.9M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/82.9M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

In [None]:
def pronunciation_scoring(audio):
  accuracy = accuracy_classifier(audio)
  fluency = fluency_classifier(audio)
  prosodic = prosodic_classifier(audio)
  result = {
      'accuracy': accuracy,
      'fluency': fluency,
      'prosodic': prosodic
      }
  for category, scores in result.items():
    max_score_label = max(scores, key=lambda x: x['score'])['label']
    result[category] = max_score_label
  return result

In [None]:
audio = '/content/audio.wav'

In [None]:
pronunciation_scoring(audio)

{'accuracy': 'Excellent', 'fluency': 'Fluent', 'prosodic': 'Perfect'}

In [None]:
import gradio as gr
from transformers import pipeline
import numpy as np

accuracy_classifier = pipeline(task="audio-classification", model="JohnJumon/pronunciation_accuracy")
fluency_classifier = pipeline(task="audio-classification", model="JohnJumon/fluency_accuracy")
prosodic_classifier = pipeline(task="audio-classification", model="JohnJumon/prosodic_accuracy")

def pronunciation_scoring(audio):
  accuracy_description = {
      'Extremely Poor': 'Extremely poor pronunciation and only one or two words are recognizable',
      'Poor': 'Poor, clumsy and rigid pronunciation of the sentence as a whole, with serious pronunciation mistakes',
      'Average': 'The overall pronunciation of the sentence is understandable, with many pronunciation mistakes and accent, but it does not affect the understanding of basic meanings',
      'Good': 'The overall pronunciation of the sentence is good, with a few pronunciation mistakes',
      'Excellent': 'The overall pronunciation of the sentence is excellent, with accurate phonology and no obvious pronunciation mistakes'
    }
  fluency_description = {
      'Very Influent': 'Intermittent, very influent speech, with lots of pauses, repetition, and stammering',
      'Influent': 'The speech is a little influent, with many pauses, repetition, and stammering',
      'Average': 'Fluent in general, with a few pauses, repetition, and stammering',
      'Fluent': 'Fluent without noticeable pauses or stammering'
    }
  prosodic_description = {
      'Poor': 'Poor intonation and lots of stammering and pauses, unable to read a complete sentence',
      'Unstable': 'Unstable speech speed, speak too fast or too slow, without the sense of rhythm',
      'Stable': 'Unstable speech speed, many stammering and pauses with a poor sense of rhythm',
      'Almost': 'Nearly correct intonation at a stable speaking speed, nearly smooth and coherent, but with little stammering and few pauses',
      'Perfect': 'Correct intonation at a stable speaking speed, speak with cadence, and can speak like a native'
    }
  accuracy = accuracy_classifier(audio)
  fluency = fluency_classifier(audio)
  prosodic = prosodic_classifier(audio)
  result = {
      'accuracy': accuracy,
      'fluency': fluency,
      'prosodic': prosodic
      }
  for category, scores in result.items():
    max_score_label = max(scores, key=lambda x: x['score'])['label']
    result[category] = max_score_label
  return result['accuracy'], accuracy_description[result['accuracy']], result['fluency'], fluency_description[result['fluency']], result['prosodic'], prosodic_description[result['prosodic']]

gradio_app = gr.Interface(
    pronunciation_scoring,
    inputs=gr.Audio(type="filepath"),
    outputs=[
        gr.Label(label="Accuracy Result"),
        gr.Textbox(interactive=False, show_label=False),
        gr.Label(label="Fluency Result"),
        gr.Textbox(interactive=False, show_label=False),
        gr.Label(label="Prosodic Result"),
        gr.Textbox(interactive=False, show_label=False)
      ],
    title="Pronunciation Scoring",
    description="This app will score your pronunciation accuracy, fluency, and prosodic (intonation)"
)

gradio_app.launch(debug=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://d5944ec1900ff5933c.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


## Submit Notebook

In [None]:
portfolio_link = "https://huggingface.co/spaces/JohnJumon/pronunciation-scoring"
presentation_link = "https://drive.google.com/drive/u/0/folders/1p4HXI_HdLl7xiAe0tbMwasLi1iQ1AMFk"

question_id = "01_portfolio_link"
submit(student_id, name, assignment_id, str(portfolio_link), question_id, drive_link)

question_id = "02_presentation_link"
submit(student_id, name, assignment_id, str(presentation_link), question_id, drive_link)

'Assignment successfully submitted'

# FIN