1\. Creating transcription helper functions
-------------------------------------------

00:00 - 00:45

You've come a long way. From exploring an audio file from scratch to manipulating audio files to working with different transcription APIs. In this chapter, you're going to be putting everything you've learned together by building a proof of concept spoken language processing pipeline. Acme Studios, a technology company, has approached you to use your speech processing skills to gain insights on their customer support calls. They've sent you a handful of audio samples to explore and to see what you can find. They let you know they're not quite sure of the quality of the files or the format they're recorded in.

2\. Exploring audio files
-------------------------

00:45 - 01:07

You open the folder of audio files Acme have sent through using the os module's listdir function and notice they're in the mp3 format. You've seen this before but before continuing you decide to write down a list of things you're going to do to prepare for building the proof of concept.

```python
# Import os module
import os

# Check the folder of audio files
os.listdir("acme_audio_files")

# List of audio file names
['call_1.mp3', 'call_2.mp3', 'call_3.mp3', 'call_4.mp3']
```

3\. Preparing for the proof of concept
--------------------------------------

01:07 - 01:37

The first thing will be to listen to a few of the files using your media player or PyDub's play function to get an understanding of what you're working with, and then to transcribe one as soon as possible using recognize google so you have a baseline to work off. You convert the first file to wav and transcribe but you know from previous work, doing this for every file is tedious.

```python
# Import speech_recognition as sr
from pydub import AudioSegment

# Import call 1 and convert to .wav
call_1 = AudioSegment.from_file("acme_audio_files/call_1.mp3")
call_1.export("acme_audio_files/call_1.wav", format="wav")

# Transcribe call 1
recognizer = sr.Recognizer()
call_1_file = sr.AudioFile("acme_audio_files/call_1.wav")

with call_1_file as source:
    call_1_audio = recognizer.record(call_1_file)
    recognizer.recognize_google(call_1_audio)
```

4\. Functions we'll create
--------------------------

01:37 - 01:54

You decide it's a good idea to create functions which will help you for the rest of the proof of concept. One to convert files to wav format, one to find stats of an audio file using PyDub and another to transcribe an audio file using recognize google.

### Convert non-.wav files to .wav format
`convert_to_wav()` converts non-.wav files to .wav files.

### Show audio file attributes
`show_pydub_stats()` displays the audio attributes of a .wav file.

### Transcribe audio
`transcribe_audio()` uses `recognize_google()` to transcribe a .wav file.

5\. Creating a file format conversion function
----------------------------------------------

01:54 - 02:24

The first one convert to wav takes a file pathname and converts the file to a wav file. You'll first import the file as an AudioSegment, then create a new file name for it using the split function on the filename and adding the dot wav string extension. Finally, you'll use the export function to export it to wav format with the new file name, similar to what you did in a previous lesson.

```python
def convert_to_wav(filename):
    """Takes an audio file of non .wav format and converts to .wav"""
    
    # Import audio file
    audio = AudioSegment.from_file(filename)
    
    # Create new filename
    new_filename = filename.split(".")[0] + ".wav"
    
    # Export file as .wav
    audio.export(new_filename, format="wav")
    
    print(f"Converting {filename} to {new_filename}...")
```

This Python function `convert_to_wav()` takes an audio file of a non-.wav format and converts it to a .wav file. Here's how it works:

1. The `AudioSegment.from_file()` function is used to import the audio file.
2. A new filename is created by taking the original filename, splitting it on the "." and adding ".wav" to the end.
3. The `audio.export()` method is used to export the audio to the new .wav file format.
4. A print statement is included to show the conversion progress.

This function can be called with the filename of the audio file you want to convert to .wav format.

6\. Using the file format conversion function
---------------------------------------------

02:24 - 02:36

Great, now you can convert audio files without repeating yourself. Now let's make one to find an audio files attributes using PyDub.

```python
convert_to_wav("acme_studios_audio/call_1.mp3")
```
Converting acme_audio_files/call_1.mp3 to acme_audio_files/call_1.wav...

This code calls the `convert_to_wav()` function with the file path `"acme_studios_audio/call_1.mp3"` as the argument. It will convert the `call_1.mp3` audio file located in the `acme_studios_audio` directory to a `.wav` format file.

The function handles the conversion process, including:

1. Importing the audio file using `AudioSegment.from_file()`.
2. Creating a new filename by taking the original filename, splitting it on the "." and adding ".wav" to the end.
3. Exporting the audio to the new .wav file format using `audio.export()`.
4. Printing a message to show the conversion progress.

After running this code, the converted .wav file will be available in the same directory as the original .mp3 file.


7\. Creating an attribute showing function
------------------------------------------

02:36 - 02:51

show pydub stats takes a filename of an audio file and imports it as an AudioSegment. It then prints a number of attributes such as number of channels, sample width, frame rate and more.

```python
def show_pydub_stats(filename):
    """Returns different audio attributes related to an audio file."""

    audio_segment = AudioSegment.from_file(filename)

    print(f"Channels: {audio_segment.channels}")
    print(f"Sample width: {audio_segment.sample_width}")
    print(f"Frame rate (sample rate): {audio_segment.frame_rate}")
    print(f"Frame width: {audio_segment.frame_width}")
    print(f"Length (ms): {len(audio_segment)}")
    print(f"Frame count: {audio_segment.frame_count()}")
```

This Python function `show_pydub_stats()` takes an audio file path as input and prints various attributes of the audio file, including:

- Number of channels
- Sample width
- Frame rate (sample rate)
- Frame width
- Length in milliseconds
- Frame count

It creates an `AudioSegment` instance from the input file and then accesses and prints the relevant attributes of the audio file.

8\. Using the attribute showing function
----------------------------------------

02:51 - 03:07

Since you're working with customer support calls, this will help especially with files with different numbers of channels. If there are two channels, you might be able to split them and transcribe each speaker separately.

```python
show_pydub_stats("acme_audio_files/call_1.wav")
```

This code calls the `show_pydub_stats()` function with the file path `"acme_audio_files/call_1.wav"` as the argument. It will print various audio attributes related to the `call_1.wav` audio file, including:

- Channels: 2
- Sample width: 2 
- Frame rate (sample rate): 32000
- Frame width: 4
- Length (ms): 54888
- Frame count: 1756416.0

9\. Creating a transcribe function
----------------------------------

03:07 - 03:33

Finally, since you could be transcribing many audio files, you create a function to transcribe an audio file. transcribe audio takes a file path of an audio file and creates a speech recognition recognizer instance. It transcribes the audio file using recognize google as you've done in a previous lesson and returns the transcribed text.

```python
def transcribe_audio(filename):
    """Takes a .wav format audio file and transcribes it to text."""

    recognizer = sr.Recognizer()
    audio_file = sr.AudioFile(filename)

    with audio_file as source:
        audio_data = recognizer.record(audio_file)
        return recognizer.recognize_google(audio_data)
```

This function `transcribe_audio()` takes a `.wav` format audio file and uses the `recognize_google()` method from the `speech_recognition` library to transcribe the audio into text. Here's how it works:

1. It creates a `Recognizer` instance to perform the speech recognition.
2. It loads the audio file using `sr.AudioFile()`.
3. It records the audio data from the file using `recognizer.record()`.
4. It then passes the audio data to `recognizer.recognize_google()` to transcribe the audio to text.
5. The transcribed text is returned as the output of the function.

To use this function, you can call it with the path to a `.wav` audio file:

```python
transcribed_text = transcribe_audio("acme_audio_files/call_1.wav")
print(transcribed_text)
```

This will print the transcribed text of the `call_1.wav` audio file.

10\. Using the transcribe function
----------------------------------

03:33 - 03:51

Testing out the function on one of the calls works as expected. It reads in an audio file and returns the transcribed text. Excellent. Setting up helper functions like this at the start of a project may seem time-consuming but they'll help save time in the long run.

```python
def transcribe_audio(filename):
    """Takes a .wav format audio file and transcribes it to text."""

    recognizer = sr.Recognizer()
    audio_file = sr.AudioFile(filename)

    with audio_file as source:
        audio_data = recognizer.record(audio_file)
        return recognizer.recognize_google(audio_data)
```

`"hello welcome to Acme studio support line my name is Daniel how can I best help you hey Daniel this is John I've recently bought a smart from you guys and I know that's not good to hear John let's let's get your cell number and then we can we can set up a way to fix it for you one number for 1757 varies how long do you reckon this is going to take about an hour now while John we're going to try our best hour I will we get the sealing member will start up this support case I'm just really really really I've been trying to contact 34 been put on hold more than an hour and a half so I'm not really happy I kind of wanna get this issue 6 is fossil"`

The `transcribe_audio()` function takes a `.wav` format audio file, loads it using `sr.AudioFile()`, records the audio data, and then uses the `recognize_google()` method to transcribe the audio to text. The transcribed text is returned as the output.

To use this function, you can call it with the path to a `.wav` audio file:

```python
transcribed_text = transcribe_audio("acme_audio_files/call_1.wav")
print(transcribed_text)
```

This will print the transcribed text of the `call_1.wav` audio file.

11\. Let's practice!
--------------------

03:51 - 04:05

With that said, it's time to build them! Once you've got these ready to go, you'll be able to use some of your natural language processing skills on the transcribed text.

Converting audio to the right format
====================================

Acme Studios have asked you to do a proof of concept to find out more about their audio files.

After exploring them briefly, you find there's a few calls but they're in the wrong file format for transcription.

As you'll be interacting with many audio files, you decide to begin by creating some helper functions.

The first one, `convert_to_wav(filename)`takes a file path and uses `PyDub` to convert it from a non-wav format to `.wav` format.

Once it's built, we'll use the function to convert [Acme's first call](https://assets.datacamp.com/production/repositories/4637/datasets/83ef1650407e911a0f52f491068e3082661db743/ex4_call_1_stereo_mp3.mp3), `call_1.mp3`, from `.mp3`format to `.wav`.

`PyDub`'s `AudioSegment` class has already been imported. Remember, to work with non-wav files, you'll need `ffmpeg` ([docs](https://www.ffmpeg.org/)).

Instructions
------------

-   Import the `filename` parameter using `AudioSegment`'s `from_file()`.
-   Set the export format to `"wav"`.
-   Pass the target audio file, `call_1.mp3`, to the function.

In [None]:
# Create function to convert audio file to wav
def convert_to_wav(filename):
  """Takes an audio file of non .wav format and converts to .wav"""
  # Import audio file
  audio = AudioSegment.from_file(filename)
  
  # Create new filename
  new_filename = filename.split(".")[0] + ".wav"
  
  # Export file as .wav
  audio.export(new_filename, format='wav')
  print(f"Converting {filename} to {new_filename}...")
 
# Test the function
convert_to_wav("call_1.mp3")  #takes "call_1.mp3" not 'call_1.mp3'

Finding PyDub stats
===================

You decide it'll be helpful to know the audio attributes of any given file easily. This will be especially helpful for finding out how many channels an audio file has or if the frame rate is adequate for transcription.

In this exercise, we'll create `show_pydub_stats()` which takes a filename of an audio file as input. It then imports the audio as a `PyDub` `AudioSegment` instance and prints attributes such as number of channels, length and more.

It then returns the `AudioSegment` instance so it can be used later on.

We'll use our function on the [newly converted .wav file](https://assets.datacamp.com/production/repositories/4637/datasets/43c5aff8c419d07f8cef70fdf40e4657b78b70be/ex4_call_1_stereo_formatted.wav), `call_1.wav`

`AudioSegment` has already imported from `PyDub`.

Instructions
------------

-   Create an `AudioSegment` instance called `audio_segment` by importing the `filename`parameter.
-   Print the number of channels using the `channels` attribute.
-   Return the `audio_segment` variable.
-   Test the function on `"call_1.wav"`.

In [None]:
def show_pydub_stats(filename):
  """Returns different audio attributes related to an audio file."""
  # Create AudioSegment instance
  audio_segment = AudioSegment.from_file(filename)
  
  # Print audio attributes and return AudioSegment instance
  print(f"Channels: {audio_segment.channels}")
  print(f"Sample width: {audio_segment.sample_width}")
  print(f"Frame rate (sample rate): {audio_segment.frame_rate}")
  print(f"Frame width: {audio_segment.frame_width}")
  print(f"Length (ms): {len(audio_segment)}")
  return audio_segment

# Try the function
call_1_audio_segment = show_pydub_stats("call_1.wav")
# output:
#     Channels: 2
#     Sample width: 2
#     Frame rate (sample rate): 32000
#     Frame width: 4
#     Length (ms): 54888

Transcribing audio with one line
================================

Alright, now you've got functions to convert audio files and find out their attributes, it's time to build one to transcribe them.

In this exercise, you'll build `transcribe_audio()` which takes a `filename`as input, imports the `filename` using `speech_recognition`'s `AudioFile` class and then transcribes it using `recognize_google()`.

You've seen these functions before but now we'll put them together so they're accessible in a function.

To test it out, we'll transcribe [Acme's first call](https://assets.datacamp.com/production/repositories/4637/datasets/43c5aff8c419d07f8cef70fdf40e4657b78b70be/ex4_call_1_stereo_formatted.wav), `"call_1.wav"`.

`speech_recognition` has been imported as `sr`.

Instructions
------------

-   Define a function called `transcribe_audio`which takes `filename` as an input parameter.
-   Setup a `Recognizer()` instance as `recognizer`.
-   Use `recognize_google()` to transcribe the audio data.
-   Pass the target call to the function.

In [None]:
def transcribe_audio(filename):
  """Takes a .wav format audio file and transcribes it to text."""
  # Setup a recognizer instance
  recognizer = sr.Recognizer()
  
  # Import the audio file and convert to audio data
  audio_file = sr.AudioFile(filename)
  with audio_file as source:
    audio_data = recognizer.record(source)
  
  # Return the transcribed text
  return recognizer.recognize_google(audio_data)

# Test the function
print(transcribe_audio("call_1.wav"))
# output:
#     hello welcome to Acme studio support line my name is Daniel how can 
# I best help you hey Daniel this is John I've recently bought a smart from 
# you guys 3 weeks ago and I'm already having issues with it I know that's not 
# good to hear John let's let's get your cell number and then we can we can set up 
# a way to fix it for you one number for 17 varies how long do you reckon this is going 
# to try our best to get the steel number will start up this support case I'm just really 
# really really really I've been trying to contact past three 4 days now and I've been put 
# on hold more than an hour and a half so I'm not really happy I kind of wanna get this issue 6 is f***** possible


Using the helper functions you've built
=======================================

Okay, now we've got some helper functions ready to go, it's time to put them to use!

You'll first use `convert_to_wav()` to convert Acme's `call_1.mp3` ([file](https://assets.datacamp.com/production/repositories/4637/datasets/56f523fb855eaecc14a87c5619ec5e6e7c4490bc/ex4_call_1_stereo_formatted_mp3.mp3)) to `.wav` format and save it as `call_1.wav`

Using `show_pydub_stats()` you find `call_1.wav` has 2 channels so you decide to split them using `PyDub`'s `split_to_mono()`. Acme tells you the [customer channel](https://assets.datacamp.com/production/repositories/4637/datasets/03ace2e9b866aaa554c465d6698500aaf48599dc/ex4_call_1_channel_2_split.wav) is likely channel 2. So you export channel 2 using `PyDub`'s `.export()`.

Finally, you'll use `transcribe_audio()` to transcribe channel 2 only.

Instructions 1/3
----------------

-   Convert the `.mp3` version of `call_1` to `.wav` and then check the stats of the `.wav` version.

In [None]:
# Convert mp3 file to wav
convert_to_wav("call_1.mp3")

# Check the stats of new file
call_1 = show_pydub_stats("call_1.wav")

Instructions 2/3
----------------

-   Split `call_1` to mono and then export the second channel in `.wav` format.

In [None]:
# Convert mp3 file to wav
convert_to_wav("call_1.mp3")

# Check the stats of new file
call_1 = show_pydub_stats("call_1.wav")

# Split call_1 to mono
call_1_split = call_1.split_to_mono()

# Export channel 2 (the customer channel)
call_1_split[1].export("call_1_channel_2.wav",
                       format="wav")

Instructions 3/3
----------------

-   Transcribe the audio of call 1's channel 2.

In [None]:
# Convert mp3 file to wav
convert_to_wav("call_1.mp3")

# Check the stats of new file
call_1 = show_pydub_stats("call_1.wav")

# Split call_1 to mono
call_1_split = call_1.split_to_mono()

# Export channel 2 (the customer channel)
call_1_split[1].export("call_1_channel_2.wav",
                       format="wav")

# Transcribe the single channel
print(transcribe_audio(call_1_split[1]))

1\. Sentiment analysis on spoken language text
----------------------------------------------

00:00 - 00:32

Now you've got some helper functions ready, it's time to start extracting information from the transcribed text. Your proposal to Acme Studios suggested sentiment analysis, the process of figuring out if text is positive, neutral or negative, would be helpful and they agreed. Knowing the sentiment of different calls may help them figure out where customers are having the most trouble. To do sentiment analysis, you decide on using the NLTK Python library.

2\. Installing sentiment analysis libraries
-------------------------------------------

00:32 - 01:17

To begin, you install NLTK using pip. Then you download the neceesary NLTK packages for sentiment analysis, punkt and vader lexicon using NLTK's download function. Since we don't have a large enough dataset to train our own sentiment analysis model, we'll use NLTK's VADER or Valance Aware Dictionary and sEntiment analyzeR as it has a pretrained sentiment analysis model in it. VADER works by analyzing each word in a piece of text and giving it a sentiment score. It was pretrained on social media text passages but will lend itself well for our proof of concept.

```python
import nltk
nltk.download("punkt")
nltk.download("vader_lexicon")
```

#### Download required NLTK packages

1. `import nltk`
2. `nltk.download("punkt")`
3. `nltk.download("vader_lexicon")`

3\. Sentiment analysis with VADER
---------------------------------

01:17 - 02:17

To start sentiment analysis, you import the SentimentIntensityAnalyzer class from the nltk sentiment vader module. And then instantiate an instance of SentimentIntensityAnalyzer and save it to the commonly named variable sid. You can then find the sentiment scores of a piece of text by calling polarity scores on sid and passing it a string. Running the function will return four values, neg for negative, neu for neutral, pos for positive and compound as an overall. The more negative a piece of text is, the higher the negative score will be and the same goes for the positive score if the text is positive. If it's in the middle, neutral will be higher. And the compound value can be thought of as the overall score with -1 being most negative and positive 1 being most positive.

```python
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create sentiment analysis instance
sid = SentimentIntensityAnalyzer()

# Test sentiment analysis on negative text
print(sid.polarity_scores("This customer service is terrible."))
```

Output:
```
{'neg': 0.437, 'neu': 0.563, 'pos': 0.0, 'compound': -0.4767}
```

4\. Sentiment analysis on transcribed text
------------------------------------------

02:17 - 02:44

You try out the sentiment analysis on one of your transcribed phone calls using only the customer channel. Reading the transcription and comparing it to what you hear when you listen to the audio file, you can see it's not perfect. But you can see the sentiment scores are leaning in the right direction. The sentiment is fairly neutral since the customer hasn't received their product yet.

```python
# Transcribe customer channel of call_3
call_3_channel_2_text = transcribe_audio("call_3_channel_2.wav")
print(call_3_channel_2_text)

# Sentiment analysis on customer channel of call_3
print(sid.polarity_scores(call_3_channel_2_text))
```

Output:
```
"hey Dave is this any better do I order products are currently on July 1st and I haven't received the product a three-week step down this parable 6987 5"

{'neg': 0.0, 'neu': 0.892, 'pos': 0.108, 'compound': 0.4404}
```

5\. Sentence by sentence
------------------------

02:44 - 03:22

From your experience with sentiment analysis, you know the sentiment can change sentence by sentence. But your current transcription function doesn't return sentences, only a large block of text. In your proposal, you mentioned this to Acme and they allocated budget for you to try a paid transcription API. You try transcribing the same audio files using a paid API service and find it returns sentences. Using NLTK's sent tokenize, you break the transcription into sentences and analyze the sentiment sentence by sentence.

```python
call_3_paid_api_text = "Okay. Yeah. Hi, Diane. This is paid on this call and obvi..."

# Import sent tokenizer
from nltk.tokenize import sent_tokenize
# Find sentiment on each sentence
for sentence in sent_tokenize(call_3_paid_api_text):
    print(sentence)
    print(sid.polarity_scores(sentence))
```

6\. Sentence by sentence
------------------------

03:22 - 03:36

This is helpful because it allows you to figure out which parts of the conversation the customer may be most displeased with. You can see the line where the transcription says this service is terrible gets a negative compound score.

```
Okay.
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.2263}

Yeah.
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.296}

Hi, Diane.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

This is paid on this call and obviously the status of my orders at three weeks ago, and that service is terrible.
{'neg': 0.129, 'neu': 0.871, 'pos': 0.0, 'compound': -0.4767}

Is this any better?
{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

Yes...
```

7\. Time to code!
-----------------

03:36 - 03:44

It's still early, but you're starting to see some insights you can report back to Acme. Let's code!

Analyzing sentiment of a phone call
===================================

Once you've transcribed the text from an audio file, it's possible to perform natural language processing on the text.

In this exercise, we'll use `NLTK`'s VADER (Valence Aware Dictionary and sEntiment Reasoner) to analyze the sentiment of the transcribed text of `call_2.wav` ([file](https://assets.datacamp.com/production/repositories/4637/datasets/82c77dc404e914eb08ce2a54a10603ef027711b8/ex4_call_2_stereo_native.wav)).

To transcribe the text, we'll use the `transcribe_audio()` function we created earlier.

Once we have the text, we'll use `NLTK`'s `SentimentIntensityAnalyzer()` class to obtain a sentiment polarity score.

`.polarity_scores(text)` returns a value for pos (positive), neu (neutral), neg (negative) and compound. Compound is a mixture of the other three values. The higher it is, the more positive the text. Lower means more negative.

Instructions
------------

-   Instantiate an instance of `SentimentIntensityAnalyzer()` and save it to the variable `sid`.
-   Transcribe the target call and save it to `call_2_text`.
-   Print the `polarity_scores()` of `call_2_text`.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create SentimentIntensityAnalyzer instance
sid = SentimentIntensityAnalyzer()

# Let's try it on one of our phone calls
call_2_text = transcribe_audio('call_2.wav')

# Display text and sentiment polarity scores
print(call_2_text)
print(sid.polarity_scores(call_2_text))

Sentiment analysis on formatted text
====================================

In this exercise, you'll calculate the sentiment on the customer channel of `call_2.wav` ([file](https://assets.datacamp.com/production/repositories/4637/datasets/82c77dc404e914eb08ce2a54a10603ef027711b8/ex4_call_2_stereo_native.wav)).

You've split the customer channel and saved it to `call_2_channel_2.wav` ([file](https://assets.datacamp.com/production/repositories/4637/datasets/bc1fa0595fda765634de7b09864a26566b5f11db/ex4_call_2_channel_2_formatted.wav)).

But from your experience with sentiment analysis, you know it can change sentence to sentence.

To calculate it sentence to sentence, you split the split using `NLTK`'s `sent_tokenize()`module.

But `transcribe_audio()` doesn't return sentences. To try sentiment anaylsis with sentences, you've tried a paid API service to get `call_2_channel_2_paid_api_text` which has sentences.

Instructions 1/3
----------------

-   -   Transcribe the audio of `call_2_channel_2.wav` and find the sentiment scores.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create SentimentIntensityAnalyzer instance
sid = SentimentIntensityAnalyzer()

# Transcribe customer channel of call 2
call_2_channel_2_text = transcribe_audio('call_2.wav')

# Display text and sentiment polarity scores
print(call_2_channel_2_text)
print(sid.polarity_scores(call_2_channel_2_text))

Instructions 2/3
----------------

-   -   Split `call_2_channel_2_text` into sentences and find the sentiment score of each sentence.

In [None]:
# Import sent_tokenize from nltk
from nltk import sent_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create SentimentIntensityAnalyzer instance
sid = SentimentIntensityAnalyzer()

# Split call 2 channel 2 into sentences and score each
for sentence in sent_tokenize(call_2_channel_2_text):
    print(sentence)
    print(sid.polarity_scores(sentence))
# output:
#     oh hi Daniel my name is Sally I recently purchased a smartphone from you guys and extremely happy with it I've just gotta issue not an issue but I've just got to learn a little bit more about the message bank on I have Google the location but I'm I'm finding it hard I thought you were on the corner of Edward and Elizabeth according to Google according to the match but would you be able to help me in some way because I think I've actually walk straight past your shop
#     {'neg': 0.017, 'neu': 0.891, 'pos': 0.091, 'compound': 0.778} 

Instructions 3/3
----------------

-   -   Split `call_2_channel_2_paid_api_text`into sentences and score the sentiment of each.

In [None]:
# Import sent_tokenize from nltk
from nltk import sent_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create SentimentIntensityAnalyzer instance
sid = SentimentIntensityAnalyzer()

# Split channel 2 paid text into sentences and score each
for sentence in sent_tokenize(call_2_channel_2_paid_api_text):
    print(sentence)
    print(sid.polarity_scores(sentence))

1\. Named entity recognition on transcribed text
------------------------------------------------

00:00 - 00:21

Now you've done some sentiment analysis on Acme's transcribed calls, you decide named entity recognition is a good next step. Entity recognition is the process of extracting objects of interest from text. To do this, you turn to spaCy, the natural language processing library.

2\. Installing spaCy
--------------------

00:21 - 00:37

To get started with spaCy, you can install it using pip. Once spaCy is installed, we can use spaCy's built-in language models for natural language processing by downloading them using the spacy download command on the command line.

```
# Install spaCy
$ pip install spacy

# Download spaCy language model  
$ python -m spacy download en_core_web_sm
```

3\. Using spaCy
---------------

00:37 - 01:13

spaCy works by turning blocks of text into docs. Docs are made up of tokens and spans. You can think of tokens as individual words and groups of tokens or sentences as spans. Let's see. First we import spacy. Then we load the language model and save it to the conventional variable nlp. Then to create a spaCy doc, we pass the string of text we want to use to nlp. Now we've got a spaCy doc, we can use spaCy's built-in features to find out more.

```python
import spacy

# Load spaCy language model
nlp = spacy.load("en_core_web_sm")

# Create a spaCy doc
doc = nlp("I'd like to talk about a smartphone I ordered on July 31st from your Sydney store, my order number is 40939440. I spoke to Georgia about it last week.")
```

4\. spaCy tokens
----------------

01:13 - 01:29

You can see what tokens a doc contains and the index where they start using dot text and dot idx on objects in your doc. The number returned by idx indicates the index of the first letter in the token.

```python
# Show different tokens and positions
for token in doc:
    print(token.text, token.idx)
```

```
I 0
'd 1
like 4 
to 9
talk 12
about 17
a 23
smartphone 25
```

5\. spaCy sentences
-------------------

01:29 - 01:37

You can see where the sentences are with dot sents. Here spaCy has broken the text in our doc into sentences.

```python
# Show sentences in doc
for sentences in doc.sents:
    print(sentence)
```

```
I'd like to talk about a smartphone I ordered on July 31st from your Sydney store, my order number is 409382.
I spoke to one of your customer service team, Georgia, yesterday.
```

6\. spaCy named entities
------------------------

01:37 - 01:55

Beautiful, now let's try using spaCy's named entity recognition. A named entity is an object which is given a name, such as, a person, product, location or date. spaCy has several of these named entities built-in it can recognize straight away.

```markdown
Some of spaCy's built-in named entities:

- PERSON People, including fictional.
- ORG Companies, agencies, institutions, etc. 
- GPE Countries, cities, states.
- PRODUCT Objects, vehicles, foods, etc. (Not services.)
- DATE Absolute or relative dates or periods.
- TIME Times smaller than a day.
- MONEY Monetary values, including unit.
- CARDINAL Numerals that do not fall under another type.
```

7\. spaCy named entities
------------------------

01:55 - 02:15

You can access the named entities in a doc using dot ents. Let's try. dot text shows us the token that the label belongs to. And dot label underscore gives us the named entity label of the text. You can see Sydney is given GPE for geopolitical entity.

```python
# Find named entities in doc
for entity in doc.ents:
    print(entity.text, entity.label_)
```

```
July 31st DATE
Sydney GPE
4093829 CARDINAL
one CARDINAL
Georgia GPE
yesterday DATE
```

8\. Custom named entities
-------------------------

02:15 - 02:56

spaCy's built-in named entities are excellent but depending your problem, you'll probably want to develop some of your own. Since Acme is a technology company, you decide it's a good idea to create a custom entity recognizer for their products. To do so, you can use spaCy's pipeline class EntityRuler. A pipeline is what spaCy uses to parse text into a doc. You can see the current pipeline you're using by calling pipeline on nlp. In our case, our pipeline has three steps, a tagger, a parser and ner for named entity recognition.

```python
# Import EntityRuler class
from spacy.pipeline import EntityRuler

# Check spaCy pipeline  
print(nlp.pipeline)
```

```
[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c3aa8a470>),
('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c3bb60588>),
('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c3bb605e8>)]
```

9\. Changing the pipeline
-------------------------

02:56 - 03:29

The EntityRuler class allows us to create another step in the pipeline. We start by making an instance of EntityRuler called ruler, passing it nlp. Then we use add patterns to add the token pattern we'd like spaCy to consider an entity. In our case, we want the smartphone token to have the entity label PRODUCT. We can add this rule to the pipeline before ner so we can be sure it gets used.

```python
# Create EntityRuler instance
ruler = EntityRuler(nlp)

# Add token pattern to ruler
ruler.add_patterns([{"label":"PRODUCT", "pattern": "smartphone"}])

# Add new rule to pipeline before ner
nlp.add_pipe(ruler, before="ner")

# Check updated pipeline
nlp.pipeline
```

```
[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c3aa8a470>),
('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c3bb60588>), 
('entityruler', <spacy.pipeline.EntityRuler at 0x1c3bb605a8>),
('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c3bb605e8>)]
```

10\. Changing the pipeline
--------------------------

03:29 - 03:34

Now when we check our pipeline we've got a new step called entity ruler.

```
[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c1f9c9b38>),
('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c3c9cba08>),
('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x1c1d834b70>),
('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c3c9cba68>)]
```

11\. Testing the new pipeline
-----------------------------

03:34 - 03:42

Let's try it with our doc from before. You can see the token smartphone now has the PRODUCT named entity label.

```python
# Test new entity rule
for entity in doc.ents:
    print(entity.text, entity.label_)
```

```
smartphone PRODUCT  
July 31st DATE
Sydney GPE
4093829 CARDINAL 
one CARDINAL
Georgia GPE
yesterday DATE
```

12\. Let's rocket and practice spaCy!
-------------------------------------

03:42 - 03:47

Woah, we covered a lot of ground in this lesson. Let's make it happen!

Named entity recognition in spaCy
=================================

Named entities are real-world objects which have names, such as, cities, people, dates or times. We can use `spaCy` to find named entities in our transcribed text.

In this exercise, you'll transcribe `call_4_channel_2.wav` ([file](https://assets.datacamp.com/production/repositories/4637/datasets/2e039462d95117677db6ddfe24377d9cadcdf730/ex4_call_4_channel_2_formatted.wav)) using `transcribe_audio()` and then use `spaCy`'s language model, `en_core_web_sm` to convert the transcribed text to a `spaCy` doc.

Transforming text to a `spaCy` doc allows us to leverage `spaCy`'s built-in features for analyzing text, such as, `.text` for tokens (single words), `.sents` for sentences and `.ents` for named entities.

Instructions 1/4
----------------

-   -   Create a `spaCy` `doc` by passing the transcribed call 4 channel 2 text to `nlp()` and then check its type.

In [None]:
import spacy

# Transcribe call 4 channel 2
call_4_channel_2_text = transcribe_audio("call_4_channel_2.wav")

# Create a spaCy language model instance
nlp = spacy.load("en_core_web_sm")

# Create a spaCy doc with call 4 channel 2 text
doc = nlp(call_4_channel_2_text)

# Check the type of doc
print(type(doc))

Instructions 2/4
----------------

-   -   Create a `spaCy` `doc` with `call_4_channel_2_text` then print all the token text in it using the `.text`attribute.

In [None]:
import spacy

# Load the spaCy language model
nlp = spacy.load("en_core_web_sm")

# Create a spaCy doc with call 4 channel 2 text
doc = nlp(call_4_channel_2_text)

# Show tokens in doc
for token in doc:
    print(token.text, token.idx)

Instructions 3/4
----------------

-   -   Load the `"en_core_web_sm"` language model and then print the sentences in the `doc` using the `.sents` attribute.

In [None]:
import spacy

# Load the spaCy language model
nlp = spacy.load("en_core_web_sm")

# Create a spaCy doc with call 4 channel 2 text
doc = nlp(call_4_channel_2_text)

# Show sentences in doc
for sentence in doc.sents:
    print(sentence)

Instructions 4/4
----------------

-   -   Access the entities in the doc using `.ents` and then print the text of each.

In [None]:
import spacy

# Load the spaCy language model
nlp = spacy.load("en_core_web_sm")

# Create a spaCy doc with call 4 channel 2 text
doc = nlp(call_4_channel_2_text)

# Show named entities and their labels
for entity in doc.ents:
    print(entity.text, entity.label_)

Creating a custom named entity in spaCy
=======================================

If `spaCy`'s built-in named entities aren't enough, you can make your own using `spaCy`'s `EntityRuler()` class.

`EntityRuler()` allows you to create your own entities to add to a `spaCy` pipeline.

You start by creating an instance of `EntityRuler()` and passing it the current pipeline, `nlp`.

You can then call `add_patterns()` on the instance and pass it a dictionary of the text `pattern` you'd like to label with an entity.

Once you've setup a pattern you can add it the `nlp` pipeline using `add_pipe()`.

Since Acme is a technology company, you decide to tag the pattern `"smartphone"` with the `"PRODUCT"` entity tag.

`spaCy` has been imported and a `doc` already exists containing the transcribed text from `call_4_channel_2.wav` file).

Instructions
------------

-   Import `EntityRuler` from `spacy.pipeline`.
-   Add `"smartphone"` as the value for the `"pattern"` key.
-   Add the `EntityRuler()` instance, `ruler`, to the `nlp` pipeline.
-   Print the entity attributes contained in `doc`.

In [None]:

# Import EntityRuler class
from spacy.pipeline import EntityRuler

# Create EntityRuler instance
ruler = EntityRuler(nlp)

# Define pattern for new entity
ruler.add_patterns([{"label": "PRODUCT", "pattern": "smartphone"}])

# Update existing pipeline
nlp.add_pipe(ruler.add_patterns, before="ner")

# Test new entity
for entity in doc.ents:
  print(entity.text, entity.label_)

1\. Classifying transcribed speech with Sklearn
-----------------------------------------------

00:00 - 00:32

Acme are impressed with your work so far and have sent over two folders full of phone call audio snippets. And they've manually labelled them with pre-purchase if the customer was calling before a purchase or post-purchase if the customer was calling after making a purchase. They said the process of labeling audio files was labor intensive and want to know if machine learning can help. You immediately start to think of building an sklearn text classifier, and that's what we'll be doing in this lesson.

2\. Inspecting the data
-----------------------

00:32 - 00:47

You inspect the folders by importing os and using the listdir function on the folder path. You notice there's about 50 files in each but they're in the mp3 format. Luckily you built a function to handle this earlier.

```python
# Inspect post purchase audio folder
import os

post_purchase_audio = os.listdir("post_purchase")
print(post_purchase_audio[:5])
```

```
['post-purchase-audio-0.mp3',
 'post-purchase-audio-1.mp3', 
 'post-purchase-audio-2.mp3',
 'post-purchase-audio-3.mp3',
 'post-purchase-audio-4.mp3']
```

3\. Converting to wav
---------------------

00:47 - 00:53

Using your convert to wav function you built earlier, you convert all the files from mp3 to wav.

```python
# Loop through mp3 files
for file in post_purchase_audio:
    print(f"Converting {file} to .wav...")
    # Use previously made function to convert to .wav
    convert_to_wav(file)
```

```
Converting post-purchase-audio-0.mp3 to .wav...
Converting post-purchase-audio-1.mp3 to .wav...
Converting post-purchase-audio-2.mp3 to .wav...
Converting post-purchase-audio-3.mp3 to .wav...
Converting post-purchase-audio-4.mp3 to .wav...
```

4\. Transcribing all phone call excerpts
----------------------------------------

00:53 - 01:25

Excellent, now they're all in wav format, you decide to create a function, create text list, to transcribe all of the files in a folder to text. You start with an empty list, then looping through the folder, if a file ends with a wav extension, you pass the filepath to your transcribe audio function which returns the text. Once you have the text, you append it to your empty list and then return the list full of transcribed text.

```python
# Transcribe text from wav files
def create_text_list(folder):
    text_list = []
    # Loop through folder
    for file in folder:
        # Check for .wav extension
        if file.endswith(".wav"):
            # Transcribe audio
            text = transcribe_audio(file)
            # Add transcribed text to list
            text_list.append(text)
    return text_list
```

5\. Transcribing all phone call excerpts
----------------------------------------

01:25 - 01:34

Running the function on the post purchase folder, returns a list of text. Let's see what the first five look like.

```python
# Convert post purchase audio to text
post_purchase_text = create_text_list(post_purchase_audio)
print(post_purchase_text[:5])
```

```
['hey man I just water product from you guys and I think is amazing but I leave a li',
'these clothes I just bought from you guys too small is there anyway I can change t',
'I recently got these pair of shoes but they\'re too big can I change the size',
'I bought a pair of pants from you guys but they\'re way too small',
'I bought a pair of pants and they\'re the wrong colour is there any chance I can ch']
```

6\. Organizing transcribed text
-------------------------------

01:34 - 02:11

Okay, we're making progress. Those helper functions came in handy. To make building your text classifier easier, you decide to put all the text into a pandas dataframe. You start by importing pandas as pd. Then create a post purchase dataframe by passing pd DataFrame a dictionary with a key named label which has a value of post purchase and a text key whose value is the text list. You do the same for the pre purchase text. And to have everything in one place, you combine the two dataframes with pd dot concat. Let's set it.

```python
import pandas as pd

# Create post purchase dataframe
post_purchase_df = pd.DataFrame({"label": "post_purchase", "text": post_purchase_text})

# Create pre purchase dataframe
pre_purchase_df = pd.DataFrame({"label": "pre_purchase", "text": pre_purchase_text})

# Combine pre purchase and post purhcase
df = pd.concat([post_purchase_df, pre_purchase_df])

# View the combined dataframe 
df.head()
```

7\. Organizing transcribed text
-------------------------------

02:11 - 02:19

Beautiful! Now you've got your data in a dataframe, you can use it to build a text classifier with sklearn.

```
             label                                               text
0  post_purchase  yeah hello someone this morning delivered a pa...
1  post_purchase  my shipment arrived yesterday but it's not the...
2  post_purchase  hey my name is Daniel I received my shipment y...
3  post_purchase  hey mate how are you doing I'm just calling in...
4   pre_purchase  hey I was wondering if you know where my new p...
```

8\. Building a text classifier
------------------------------

02:19 - 03:00

We'll start by importing the necessary packages. Numpy as np, Pipeline from sklearn's pipeline module, MultinomialNB from sklearn's naive bayes module for our classifier, CountVectorizer and TfidfTransformer from sklearn's text feature extraction module to transform our text into numbers. And train test split to split our data into training and test sets. To start, we'll use train test split to split the data using a test size of 30%. Where our X value is the text column and our y value is the label column of the dataframe we created earlier.

```python
# Import text classification packages
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X=df["text"],
    y=df["label"],
    test_size=0.3)
```

9\. Naive Bayes Pipeline
------------------------

03:00 - 03:28

Next, you setup a classifier pipeline as text classifier which uses CountVectorizer and TfidfTransformer to transform each of the test samples into a certain value depending on the words they contain. Then MultinomialNB builds a naive bayes model to classifiy each sample. To train the model you call the fit function on your text classifier and pass it the training data.

```python
# Create text classifier pipeline 
text_classifier = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("tfidf", TfidfTransformer()),
    ("classifier", MultinomialNB())
])

# Fit the classifier pipeline on the training data
text_classifier.fit(X_train, y_train)
```

10\. Not so Naive
-----------------

03:28 - 03:47

Once you've got a trained model, you can evaluate it by calling the predict function on your classifier and passing it the test set data. Then you can use Numpy to compare the predictions to the test data labels. That's not a bad model! Not so Naive after all.

```python
# Make predictions and compare them to test labels
predictions = text_classifier.predict(X_test)
accuracy = 100 * np.mean(predictions == y_test.label)
print(f"The model is {accuracy:.2f}% accurate.")
```

```
The model is 97.87% accurate.
```

11\. Let's practice!
--------------------

03:47 - 03:54

Alright, you've seen enough, time to get this model into Acme's hands! Let's code.

Preparing audio files for text classification
=============================================

Acme are very impressed with your work so far. So they've sent over two more folders of audio files.

One folder is called `pre_purchase` and contains audio snippets from customers who are pre-purchase, like `pre_purchase_audio_25.mp3` ([file](https://assets.datacamp.com/production/repositories/4637/datasets/2acd3f72cd3753f200fae1479d7c06f2ea70cf7d/pre-purchase-audio-25.wav)). 

And the other is called `post_purchase` and contains audio snippets from customers who have made a purchase (post-purchase), like `post_purchase_audio_27.mp3` ([file](https://assets.datacamp.com/production/repositories/4637/datasets/30c755abc91782decd347c0b7c3b2c9ab86751a0/post-purchase-audio-27.wav)).

Upon inspecting the files you find there's about 50 in each and they're in the `.mp3` format.

Acme want to know if you can build a classifier to classify future calls. You tell them you sure can.

So in this exercise, you'll go through each folder and convert the audio files to `.wav` format using `convert_to_wav()` so you can transcribe them.

In [None]:
for file in post_purchase:
    print(f"Converting {file} to .wav...")
    convert_to_wav("post_purchase_audio_0.mp3")

# Convert pre purchase
for file in pre_purchase:
    print(f"Converting {file} to .wav...")
    convert_to_wav(file)

Transcribing phone call excerpts
================================

In this exercise, we'll transcribe the audio files we converted to `.wav` format to text using `transcribe_audio()`.

Since there's lots of them and there could be more, we'll build a function `create_test_list()` which takes a list of filenames of audio files as input and goes through each file transcribing the text.

`create_test_list()` uses our `transcribe_audio()` function we created earlier and returns a list of strings containing the transcribed text from each audio file.

`pre_purchase_wav_files` and `post_purchase_wav_files` are lists of audio snippet filenames.

Instructions 1/2
----------------

-   Use `transcribe_audio()` to transcribe the current `file` to text and add it to the text list.
-   Return the text list.

In [None]:
def create_text_list(folder):
  # Create empty list
  text_list = []
  
  # Go through each file
  for file in folder:
    # Make sure the file is .wav
    if file.endswith(".wav"):
      print(f"Transcribing file: {file}...")
      
      # Transcribe audio and append text to list
      text_list.append(transcribe_audio(file))   
  return text_list

create_text_list(folder)

Instructions 2/2
----------------

-   Use `create_text_list()` to transcribe all post and pre purchase audio snippets.
-   Check the first transcription of the post purchase text list.

In [None]:
post_purchase_text = create_text_list(post_purchase_wav_files)
pre_purchase_text = create_text_list(pre_purchase_wav_files)

# Inspect the first transcription of post purchase
print(post_purchase_text[0])

Organizing transcribed phone call data
======================================

We're almost ready to build a text classifier. But right now, all of our transcribed text data is in two lists, `pre_purchase_text` and `post_purchase_text`.

To organize it better for building a text classifier as well as for future use, we'll put it together into a pandas DataFrame.

To start we'll import `pandas` as `pd` then we'll create a post purchase dataframe, `post_purchase_df` using `pd.DataFrame()`.

We'll pass `pd.DataFrame()` a dictionary containing a `"label"` key with a value of `"post_purchase"` and a `"text"` key with a value of our `post_purchase_text` list.

We'll do the same for `pre_purchase_df` except with `pre_purchase_text`.

To have all the data in one place, we'll use `pd.concat()` and pass it the pre and post purchase DataFrames.

Instructions
------------

-   Create `post_purchase_df` using the `post_purchase_text` list.
-   Create `pre_purchase_df` using the `pre_purchase_text` list.
-   Combine the two DataFrames using `pd.concat()`.

In [None]:
import pandas as pd

# Make dataframes with the text
post_purchase_df = pd.DataFrame({"label": "post_purchase",
                                 "text": post_purchase_text})
pre_purchase_df = pd.DataFrame({"label": "pre_purchase",
                                "text": pre_purchase_text})

# Combine DataFrames
df = pd.concat([post_purchase_df, pre_purchase_df])

# Print the combined DataFrame
print(df.head())

Create a spoken language text classifier
========================================

Now you've transcribed some customer call audio data, we'll build a model to classify whether the text from the customer call is `pre_purchase` or `post_purchase`.

We've got 45 examples of `pre_purchase` calls and 57 examples of `post_purchase` calls.

The data the model will train on is stored in `train_df` and the data the model will predict on is stored in `test_df`.

Try printing the `.head()` of each of these to the console.

We'll build an `sklearn pipeline` using `CountVectorizer()` and `TfidfTransformer()`to convert our text samples to numbers and then use a `MultinomialNB()` classifier to learn what category each sample belongs to.

This model will work well on our small example here but for larger amounts of text, you may want to consider something more sophisticated.

Instructions 1/2
----------------

-   Create `text_classifier` using `CountVectorizer()`, `TfidfTransformer()`, and `MultinomialNB()`.
-   Fit `text_classifier` on `train_df.text`and `train_df.label`.

In [None]:
# Build the text_classifier as an sklearn pipeline
text_classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB()),
])

# Fit the classifier pipeline on the training data
text_classifier.fit(train_df.text, train_df.label)

# Evaluate the MultinomialNB model
predicted = text_classifier.predict(test_df.text)
accuracy = 100 * np.mean(predicted == test_df.label)
print(f'The model is {accuracy}% accurate')

Instructions 2/2
----------------

-   Create `predicted` by calling `predict()` on `text_classifier` and passing it the text column of `test_df`.
-   Evaluate the model by seeing how `predicted` compares to the `test_df.label`.

In [None]:
# Build the text_classifier as an sklearn pipeline
text_classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB()),
])

# Fit the classifier pipeline on the training data
text_classifier.fit(train_df.text, train_df.label)

# Evaluate the MultinomialNB model
predicted = text_classifier.predict(test_df.text)
accuracy = 100 * np.mean(predicted == test_df.label)
print(f'The model is {accuracy}% accurate')