<a href="https://colab.research.google.com/github/camilleoconn/QTM350Contadina/blob/master/Grammar%20Check%20Walk-through.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Counting Grammar Mistakes for Native versus Non-Native Speakers

Proceed to these steps once you have created your translated JSON files for the written texts you would like to compare. These will have either been created by uploading JSON files to an s3 bucket or to your local directory, and then running this command in the shell:

```
aws translate translate-text \
            --region region \
            --cli-input-json file://translate.json > translated.json
```



### A word on naming files

For ease, name the .json files for the native speaker with a prefix that distinguishes it from .json files of the non-native (learner) speaker. For example `n_chinese.json` for the native speaker and `l_chinese.json` for the learner speaker.


### Now to begin 
We will be using a Python wrapper for the open-source grammar tool [LanguageTool](https://predictivehacks.com/languagetool-grammar-and-spell-checker-in-python/) called [language_tool_python](https://pypi.org/project/language-tool-python/). From their documentation, this library allows you to detect grammar errors and spelling mistakes through a Python script or through a command-line interface.

In the shell run this command to install the wrapper in your local directory.

```
$ pip install language_tool_python
```

### S3 buckets to store files

Next we make two S3 buckets into which we will copy over the translated .json files from our local directory so that we can compile the data for processing.

The first bucket will contain the translated .json files from the native speaker and the second will contain the translated .json files from the non-native speaker (which we will call "learner"). Bucket names must be globally unique, so make sure to adjust the code after the double slashes.

```
$ aws s3 mb s3://trans-native
$ aws s3 mb s3://trans-learner
```

For additional guidance, refer to [this AWS CLI user guide](https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html) for creating buckets. This step can also be done in the console.




### Now from your current working directory, we will copy over the native speaker and learner .json files into their repsective buckets.

First, for the native speaker:

```
# Include all .json files with the "n_*.json" format to be copied in bucket
$ aws s3 cp $(pwd) s3://trans-native/ --recursive --exclude "*" --include "n_*.json"
```

Next, for the learner speaker:

```
# Include all .json files with the "l_*.json" format to be copied in bucket
$ aws s3 cp $(pwd) s3://mysecondbucket/ --recursive --exclude "*" --include "l_*.json"
```




### Great, now that we have our files stored in buckets we can use a python script that will read the specified files in both buckets and compare the total count of grammar mistakes to word count.

Open a text file and paste in the following python code. Title this script `grammarmistake.py`. This script can also be found on Github here.

Keep in mind that you will have to change line number 18 and 37 to specify your unique bucket name and JSON file name.

```
#!/bin/bash

import language_tool_python
import boto3
import json

# mention the language keyword
tool = language_tool_python.LanguageTool('en-US')

# set counts of variables to 0
num_mistakes_native = 0
word_count_native = 0
num_mistakes_learner = 0
word_count_learner = 0

# pulling files from s3 bucket for native speaker
s3 = boto3.resource('s3')
content_object = s3.Object('trans-native', 't_n_ital1.json')
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)

# reading .json as string
text = json_content['TranslatedText']

# for loop for word count
for i in range(len(text)):
    if(text[i] == ' ' or text == '\n' or text == '\t'):
        word_count_native = word_count_native + 1
        
# for loop for checking how many grammar mistakes
for i in range(len(text)):
    matches = tool.check(text[i])
    num_mistakes_native = num_mistakes_native + len(matches)
    
# repeat process for the non-native speaker    
# pulling files from s3 bucket for non-native "learner" speaker
content_object = s3.Object('trans-learner', 't_l_ital1.json')
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)

# reading .json as string
text = json_content['TranslatedText']

# for loop for word count
for i in range(len(text)):
    if(text[i] == ' ' or text == '\n' or text == '\t'):
        word_count_learner = word_count_learner + 1
        
# for loop for checking how many grammar mistakes
for i in range(len(text)):
    matches = tool.check(text[i])
    num_mistakes_learner = num_mistakes_learner + len(matches)


print("The number of words in the native speaker document is", word_count_native)
print("The number of mistakes in the native speaker document is", num_mistakes_native)
print("The number of words in the non-native speaker document is", word_count_learner)
print("The number of mistakes in the non-native speaker document is", num_mistakes_learner)

print("For the native speaker the grammar mistake rate is", num_mistakes_native*100/word_count_native, "%")
print("For the non-native speaker the grammar mistake rate is", num_mistakes_learner*100/word_count_learner, "%")
```



Now in the shell made the script executable by running the following command:
```
$ chmod u+x grammarmistake.py
```
And now lets execute the script. It may take a while to run depending on the length of the text.
```
$ python grammarmistake.py
```




### And here's the output using our data of a non-native Italian speaker and a native Italian speaker



```
>>> The number of words in the native speaker document is: 145
>>> The number of mistakes in native speaker document is: 13
>>> The number of words in the non-native speaker document is: 140
>>> The number of mistakes in the non-native speaker document is: 20
>>> For the native speaker the grammar mistake rate is: 8.96551724137931 %
>>> For the non-native speaker the grammar mistake rate is: 14.285714285714286 %
```



We see that the mistake rate is higher for the non-native speaker, which is not surprising, however we had hypothesized otherwise. Since this is only a very small sample size, no real conclusions can be drawn from this singular comparison. However, we can employ another machine learning service, Amazon Comprehend, to get a closer look at the readability of these texts. This can be tested for many different samples (though I struggled to make a for loop that could iterate over files in a bucket, hence the need to manually adjust the script). 