<a href="https://colab.research.google.com/github/dhruvchopra2003/NLP/blob/main/speech_to_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
# Task - 1

In [10]:
! pip install -q transformers

In [11]:
import librosa 
import torch
import nltk
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
import soundfile as sf

In [13]:
def load_wav2vec_960h_model():
  """
  Returns the tokenizer and the model from pretrained tokenizers models
  """
  tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
  model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")    
  return tokenizer, model

def correct_uppercase_sentence(input_text): 
  """
  Returns the corrected sentence
  """  
  sentences = nltk.sent_tokenize(input_text)
  return (' '.join([s.replace(s[0],s[0].capitalize(),1) for s in sentences]))

In [14]:
def asr_transcript(tokenizer, model, input_file):
  """
  Returns the transcript of the input audio recording

  Output: Transcribed text
  Input: Huggingface tokenizer, model and wav file
  """
  #read the file
  speech, samplerate = sf.read(input_file)
  #make it 1-D
  if len(speech.shape) > 1: 
      speech = speech[:,0] + speech[:,1]
  #Resample to 16khz
  if samplerate != 16000:
      speech = librosa.resample(speech, samplerate, 16000)
  #tokenize
  input_values = tokenizer(speech, return_tensors="pt").input_values
  #take logits
  logits = model(input_values).logits
  #take argmax (find most probable word id)
  predicted_ids = torch.argmax(logits, dim=-1)
  #get the words from the predicted word ids
  transcription = tokenizer.decode(predicted_ids[0])
  #output is all uppercase, make only the first letter in first word capitalized
  transcription = correct_uppercase_sentence(transcription.lower())
  return transcription

In [15]:
#load model and tokenizer 
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
wav_input = 'sales_call_telephone_marketers.wav'
tokenizer, model = load_wav2vec_960h_model()
text = asr_transcript(tokenizer,model,wav_input)
print(text)

In [17]:
# Task 2

In [72]:
import pandas as pd
df = pd.read_csv("/content/NLP.csv")
df

Unnamed: 0,Intent,Example Sentence,Entities names
0,Intro,My name is Jeff and I am calling from Amazon.,caller_name: Jeff company: Amazon
1,Intro,I am calling from Microsoft and my name is Satya.,caller_name: Satya company: Microsoft
2,Intro,I am Sundar and I am calling from Google,caller_name: Sundar company: Google
3,Purpose,I am calling about your Microsoft Azure subscr...,product: Microsoft Azure
4,Purpose,This is call regarding your google cloud platf...,product: google cloud platform
5,Purpose,I would like to talk about your amazon web ser...,product: amazon web services


In [73]:
# removing duplicats
df = df.drop_duplicates(keep='last')

In [74]:
df

Unnamed: 0,Intent,Example Sentence,Entities names
0,Intro,My name is Jeff and I am calling from Amazon.,caller_name: Jeff company: Amazon
1,Intro,I am calling from Microsoft and my name is Satya.,caller_name: Satya company: Microsoft
2,Intro,I am Sundar and I am calling from Google,caller_name: Sundar company: Google
3,Purpose,I am calling about your Microsoft Azure subscr...,product: Microsoft Azure
4,Purpose,This is call regarding your google cloud platf...,product: google cloud platform
5,Purpose,I would like to talk about your amazon web ser...,product: amazon web services


In [75]:
# applying bag of words/ Count Vectorizer
df['Example Sentence'][3]

'I am calling about your Microsoft Azure subscription'

In [76]:
df['Intent'][3]

'Purpose'

In [77]:
x = df['Example Sentence'].values
y = df['Intent'].values

In [78]:
from sklearn.model_selection import train_test_split
x_train,x_test, y_train, y_test = train_test_split(x,y,random_state=0)

In [79]:
x_train.shape

(4,)

In [80]:
x_test.shape

(2,)

In [81]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [82]:
vect = CountVectorizer(stop_words = 'english')
x_train_vect = vect.fit_transform(x_train)
x_test_vect = vect.transform(x_test)

In [83]:
x_train_vect.toarray()#converting text into numerical array

array([[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1],
       [0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0]])

In [84]:
# using pipelines: SVC + CountVectorizer
from sklearn.svm import SVC

In [85]:
from sklearn.pipeline import make_pipeline
model2 = make_pipeline(CountVectorizer(), SVC())
model2.fit(x_train, y_train)
y_pred2 = model2.predict(x_test)
y_pred2

array(['Purpose', 'Intro'], dtype=object)

In [86]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred2, y_test)

1.0

In [87]:
model = SVC()
model.fit(x_train_vect, y_train)

SVC()

In [88]:
y_pred = model.predict(x_test_vect)
y_pred

array(['Intro', 'Intro'], dtype=object)

In [89]:
accuracy_score(y_pred, y_test)

0.5

In [90]:
import joblib
joblib.dump(model2, 'Intent')

['Intent']

In [91]:
import joblib
reload_model = joblib.load('Intent')

In [92]:
reload_model.predict(['I am calling from microsoft My name is Satya'])

array(['Intro'], dtype=object)

In [93]:
reload_model.predict(['This is regarding your Google cloud platform'])

array(['Purpose'], dtype=object)

In [94]:
reload_model.predict(["I wanted to tell you about airtel's new plan"])

array(['Purpose'], dtype=object)

In [95]:
reload_model.predict(['My name is Dhruv from microsoft'])

array(['Intro'], dtype=object)

In [42]:
# making a similar model to differentiate entities from a sentence

In [98]:
a = df['Example Sentence'].values
b = df['Entities names'].values

In [99]:
from sklearn.model_selection import train_test_split
a_train, a_test, b_train, b_test = train_test_split(a,b, random_state = 0)

In [100]:
a_train.shape

(4,)

In [101]:
a_test.shape

(2,)

In [102]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [103]:
vect = CountVectorizer(stop_words = 'english')
a_train_vect = vect.fit_transform(a_train)
a_test_vect = vect.fit_transform(a_test)

In [104]:
a_train_vect.toarray()

array([[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1],
       [0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0]])

In [105]:
from sklearn.pipeline import make_pipeline
model2 = make_pipeline(CountVectorizer(), SVC())
model2.fit(a_train, b_train)
b_pred2 = model2.predict(a_test)
b_pred2

array(['product: Microsoft Azure',
       'caller_name: Satya company: Microsoft'], dtype=object)

In [106]:
accuracy_score(b_pred2, b_test)

0.0

In [107]:
from sklearn.naive_bayes import MultinomialNB
model3 = MultinomialNB()

In [108]:
model3.fit(a_train_vect, b_train)

MultinomialNB()

In [109]:
b_pred3 = model3.predict(x_test_vect)
b_pred3

array(['caller_name: Jeff company: Amazon',
       'caller_name: Jeff company: Amazon'], dtype='<U37')