## DS-UA 301: NLP & RL -- Assignment 1

Welcome! In this notebook, we will explore and practice several foundational concepts and techniques in Natural Language Processing, specifically in the "pre-processing" stage: 
+ Finding and getting a corpus up and running in your Jupyter notebook
+ Cleaning & tokenizing text data
+ POS tagging 
+ Stemming & lemmatization

We'll also *start to* think about approaching our NLP work from a hypothesis-driven approach. This will help for the project, as well as life in general, as we've elaborated ad nauseum in lecture already (and, fair warning, will continue to do).

Some tips: 
+ You may use any data you like (built-in NLTK or something else!) as long as you are able to complete all the prompts below.
+ You're also welcome to try the assignment with both an NLTK corpus *and* something you collect yourself! No extra credit, just extra learning :).
+ While we will be grading you for correctness and completeness; i.e., you must correctly complete the specific tasks we ask of you, there is rarely a single "right answer" in much of data science, NLP very much included, when it comes to *which* strategy to use.
+ What this means for you is: for example, while we do need you to, e.g., remove all capital letters if we ask you to, for questions about your decision-making about whether or not removing capital letters is appropriate, we care more that you thoughtfully and transparently explain your reasoning, rather than worrying whether it's "right" to keep them or not. 
+ Of course, we want you to use your best judgement, but usually there isn't a single correct answer out there floating around for questions about what method you use. It's about balancing tradeoffs, and we want to see that you understand the tradeoffs.
+ In other words, to loosely quote a wise data scientist (not me, ha ha ha (but really it isn't)): **Keep the precision but let go of perfectionism!**
+ Also, 2-3 sentences (at most!) for the non-code questions should generally be enough. No need for essays!

Finally, each question is worth 1 whole point, for 45 points total.

## 1. Finding some text you'd like to represent as data

This step may be less trivial an exercise than it seems at first glance. As with structured, numeric data, finding data that's both usable *and* addresses the question you're interested in is often no small feat.

If you go the NLP route for the project, you'll eventually need to use your own data (alas, NLTK library data is not allowed for that). But for this assignment, especially if it's your first go at NLP, you're more than welcome to use something built in so you can focus on the techniques.

That said, it may make your life (eventually) easier if you start exploring for and with your own data now!

(a) What text will you be investigating in this assignment?

I'm going to scrap some sample blogs from the website: https://towardsdatascience.com/, a blog column mainly focus on topics in data science.

For this homework, I'm going to use this sample article: https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760

(b) Why did you choose this text? What do you hope to learn from it? (Even if you're working with an NLTK corpus, surely you have a reason you were drawn to one!)

I'm personally interested in how data literacy could be educated through free resource online. After analyzing such corpus, I hope to study what are some popular topics, and potentially popular techniques to learn.

(c) Imagine, hypothetically, that you needed to form a hypothesis about this data (you won't have to test it in this assignment). What are three hypotheses you could test using this data? (They can be three related hypotheses, or not! You can also imagine that you would eventaully add other data if needed, or not.)

H1: the word \"python\" appears relatively frequent than other type of articles, for example news

H2: There\'re more capitalized words(standing for proper noun of models, etc.) than other type of articles

H3: There\'re more digits and mathematical expression here than other type of articles

(d) For *one* of the hypotheses above, how would you test it, and how would you know if your hypothesis is wrong? In other words, what sort of result would disprove it in the context of your study? (Again, you don't have to test anything in this assignment, but (obviously) this is practice for the project!)

Take the first hypothesis as an example. I will start from collecting a bunch of towardsdatascience blogs and news, and compute the average frequency of word \"python\" for each type of article, i.e. freq(\"python\") = count(\"python\") / len(document). If the hypothesis is true, we expect a higher average frequency of word \"python\" in towardsdatascience blogs. 

We could directly take the original texts and compute the frequency, or as a potential improvement, we could also first clean the text by removing punctuation, stop words and so on to represent how much does the word \"python\" take up on meaningful information of each type of text.

(e) Ok, it's time to get it ready to go in Python. Import, load, or do whatever's necessary so your text is workable in one document here (you can have more than one document, but it's not needed for this assignment).

The popularity of data science attracts a lot of people from a wide range of professions to make a career change with the goal of becoming a data scientist.Despite the high demand for data scientists, it is a highly challenging task to find your first job. Unless you have a solid prior job experience, interviews are where you can show you skills and impress your potential employer.Data science is an interdisciplinary field which covers a broad range of topics and concepts. Thus, the number of questions that you might be asked at an interview is very high.However, there are some questions about the fundamentals in data science and machine learning. These are the ones you do not want to miss. In this article, we will go over 10 questions that are likely to be asked at a data scientist interview.The questions are grouped into 3 main categories which are machine learning, Python, and SQL. I will try to provide a brief answer for each question. However, I suggest reading or studying each one in more detail afterwards.Overfitting in machine learning occurs when your model is not generalized well. The model is too focused on the training set. It captures a lot of detail or even noise in the training set. Thus, it fails to capture the general trend or the relationships in the data. If a model is too complex compared to the data, it will probably be overfitting.A strong indicator of overfitting is the high difference between the accuracy of training and test sets. Overfit models usually have very high accuracy on the training set but the test accuracy is usually unpredictable and much lower than the training accuracy.We can reduce overfitting by making the model more generalized which means it should be more focused on the general trend rather than specific details.If it is possible, collecting more data is an efficient way to reduce overfitting. You will be giving more juice to the model so it will have more material to learn from. Data is always valuable especially for machine learning models.Another method to reduce overfitting is to reduce the complexity of the model. If a model is too complex for a given task, it will likely result in overfitting. In such cases, we should look for simpler models.We have mentioned that the main reason for overfitting is a model being more complex than necessary. Regularization is a method for reducing the model complexity.It does so by penalizing higher terms in the model. With the addition of a regularization term, the model tries to minimize both loss and complexity.Two main types of regularization are L1 and L2 regularization. L1 regularization subtracts a small amount from the weights of uninformative features at each iteration. Thus, it causes these weights to eventually become zero.On the other hand, L2 regularization removes a small percentage from the weights at each iteration. These weights will get closer to zero but never actually become 0.Both are machine learning tasks. Classification is a supervised learning task so we have labelled observations (i.e. data points). We train a model with labelled data and expect it to predict the labels of new data.For instance, spam email detection is a classification task. We provide a model with several emails marked as spam or not spam. After the model is trained with those emails, it will evaluate the new emails appropriately.Clustering is an unsupervised learning task so the observations do not have any labels. The model is expected to evaluate the observations and group them into clusters. Similar observations are placed into the same cluster.In the optimal case, the observations in the same cluster are as close to each other as possible and the different clusters are as far apart as possible. An example of a clustering task would be grouping customers based on their shopping behavior.The built-in data structures are of crucial importance. Thus, you should be familiar with what they are and how to interact with them. List, dictionary, set, and tuple are 4 main built-in data structures in Python.The main difference between lists and tuples is mutability. Lists are mutable so we can manipulate them by adding or removing items.On the other hand, tuples are immutable. Although we can access each element in a tuple, we cannot modify its content.One important point to mention here is that although tuples are immutable, they can contain mutable elements such as lists or sets.Let’s do an example to demonstrate the main difference between lists and sets.As we notice in the resulting objects, the list contains all the characters in the string whereas the set only contains unique values.Another difference is that the characters in the list are ordered based on their location in the string. However, there is no order associated with the characters in the set.Here is a table that summarizes the main characteristics of lists, tuples, and sets.A dictionary in Python is a collection of key-value pairs. It is similar to a list in the sense that each item in a list has an associated index starting from 0.In a dictionary, we have keys as the index. Thus, we can access a value by using its key.The keys in a dictionary are unique which makes sense because they act like an address for the values.SQL is an extremely important skill for data scientists. There are quite a number of companies that store their data in a relational database. SQL is what is needed to interact with relational databases.You will probably be asked a question that involves writing a query to perform a specific task. You might also be asked a question about general database knowledge.Consider we have a sales table that contains daily sales quantities of products.Find the top 5 weeks in terms of total weekly sales quantities.We first extract the year and week information from the date column and then use it in the aggregation. The sum function is used to calculate the total sales quantities.In the same sales table, find the number of unique items that are sold each month.These terms are related to database schema design. Normalization and denormalization aim to optimize different metrics.The goal of normalization is to reduce data redundancy and inconsistency by increasing the number of tables. On the other hand, denormalization aims to speed up the query execution. Denormalization decreases the number of tables but at the same time, it adds some redundancy.It is a challenging task to become a data scientist. It requires time, effort, and dedication. Without having prior job experience, the process gets harder.Interviews are very important to demonstrate your skills. In this article, we have covered 10 questions that you are likely to encounter in a data scientist interview.Thank you for reading. Please let me know if you have any feedback.

(f) How many words are we working with here? (I.e., how many are in the corpus you'll be using in this assignment?)

There are 1165 words, according to the word count function in Microsoft Word.

## 2. Capitalization

(a) Hooray! You're ready to start pre-processing. First, consider capitalizations. Before we do anything with them, think about what you'd like to do with this data, and what sort of data it is. Do you think you'll want to keep the capitalizations, or change everything to lowercase? Or something else?

I will change everything to lowercase.

(b) Briefly explain your answer to 2a above.

Since we focus on analyzing the content instead of the text format using in the blog, changing the whole text into lowercase is good and straightforward.

(c) For fun(!), regardless of your answer to 2a, write some code to remove all capitalizations. (If you want to keep the capitalization in later stages, just comment it out after you confirm it works, but leave the code.)

In [1]:
text = """The popularity of data science attracts a lot of people from a wide range of professions to make a career change with the goal of becoming a data scientist.Despite the high demand for data scientists, it is a highly challenging task to find your first job. Unless you have a solid prior job experience, interviews are where you can show you skills and impress your potential employer.Data science is an interdisciplinary field which covers a broad range of topics and concepts. Thus, the number of questions that you might be asked at an interview is very high.However, there are some questions about the fundamentals in data science and machine learning. These are the ones you do not want to miss. In this article, we will go over 10 questions that are likely to be asked at a data scientist interview.The questions are grouped into 3 main categories which are machine learning, Python, and SQL. I will try to provide a brief answer for each question. However, I suggest reading or studying each one in more detail afterwards.Overfitting in machine learning occurs when your model is not generalized well. The model is too focused on the training set. It captures a lot of detail or even noise in the training set. Thus, it fails to capture the general trend or the relationships in the data. If a model is too complex compared to the data, it will probably be overfitting.A strong indicator of overfitting is the high difference between the accuracy of training and test sets. Overfit models usually have very high accuracy on the training set but the test accuracy is usually unpredictable and much lower than the training accuracy.We can reduce overfitting by making the model more generalized which means it should be more focused on the general trend rather than specific details.If it is possible, collecting more data is an efficient way to reduce overfitting. You will be giving more juice to the model so it will have more material to learn from. Data is always valuable especially for machine learning models.Another method to reduce overfitting is to reduce the complexity of the model. If a model is too complex for a given task, it will likely result in overfitting. In such cases, we should look for simpler models.We have mentioned that the main reason for overfitting is a model being more complex than necessary. Regularization is a method for reducing the model complexity.It does so by penalizing higher terms in the model. With the addition of a regularization term, the model tries to minimize both loss and complexity.Two main types of regularization are L1 and L2 regularization. L1 regularization subtracts a small amount from the weights of uninformative features at each iteration. Thus, it causes these weights to eventually become zero.On the other hand, L2 regularization removes a small percentage from the weights at each iteration. These weights will get closer to zero but never actually become 0.Both are machine learning tasks. Classification is a supervised learning task so we have labelled observations (i.e. data points). We train a model with labelled data and expect it to predict the labels of new data.For instance, spam email detection is a classification task. We provide a model with several emails marked as spam or not spam. After the model is trained with those emails, it will evaluate the new emails appropriately.Clustering is an unsupervised learning task so the observations do not have any labels. The model is expected to evaluate the observations and group them into clusters. Similar observations are placed into the same cluster.In the optimal case, the observations in the same cluster are as close to each other as possible and the different clusters are as far apart as possible. An example of a clustering task would be grouping customers based on their shopping behavior.The built-in data structures are of crucial importance. Thus, you should be familiar with what they are and how to interact with them. List, dictionary, set, and tuple are 4 main built-in data structures in Python.The main difference between lists and tuples is mutability. Lists are mutable so we can manipulate them by adding or removing items.On the other hand, tuples are immutable. Although we can access each element in a tuple, we cannot modify its content.One important point to mention here is that although tuples are immutable, they can contain mutable elements such as lists or sets.Let’s do an example to demonstrate the main difference between lists and sets.As we notice in the resulting objects, the list contains all the characters in the string whereas the set only contains unique values.Another difference is that the characters in the list are ordered based on their location in the string. However, there is no order associated with the characters in the set.Here is a table that summarizes the main characteristics of lists, tuples, and sets.A dictionary in Python is a collection of key-value pairs. It is similar to a list in the sense that each item in a list has an associated index starting from 0.In a dictionary, we have keys as the index. Thus, we can access a value by using its key.The keys in a dictionary are unique which makes sense because they act like an address for the values.SQL is an extremely important skill for data scientists. There are quite a number of companies that store their data in a relational database. SQL is what is needed to interact with relational databases.You will probably be asked a question that involves writing a query to perform a specific task. You might also be asked a question about general database knowledge.Consider we have a sales table that contains daily sales quantities of products.Find the top 5 weeks in terms of total weekly sales quantities.We first extract the year and week information from the date column and then use it in the aggregation. The sum function is used to calculate the total sales quantities.In the same sales table, find the number of unique items that are sold each month.These terms are related to database schema design. Normalization and denormalization aim to optimize different metrics.The goal of normalization is to reduce data redundancy and inconsistency by increasing the number of tables. On the other hand, denormalization aims to speed up the query execution. Denormalization decreases the number of tables but at the same time, it adds some redundancy.It is a challenging task to become a data scientist. It requires time, effort, and dedication. Without having prior job experience, the process gets harder.Interviews are very important to demonstrate your skills. In this article, we have covered 10 questions that you are likely to encounter in a data scientist interview.Thank you for reading. Please let me know if you have any feedback."""
text_lower = text.lower()

(d) For even more(!) fun, regardless of your answer to 2a, write some code to instead remove only capitalizations that appear at the beginning of each sentence. (Again, if this doesn't suit your overall analysis, just comment it out after you're done.)

Inspecting the following result, we could find terms like \"SQL\", \"L1 regularization\" are kept capitalized at the first letter

In [2]:
# get puncutation
import string
#punctuation = string.punctuation
is_punct = True
text_lowFirst = ""
for letter in text:
    if is_punct:
        if letter in string.ascii_letters:
            text_lowFirst += letter.lower()
        else:
            text_lowFirst += letter
    else:
        text_lowFirst += letter
    if letter == " ":
        continue
    is_punct = letter in string.punctuation
text_lowFirst

'the popularity of data science attracts a lot of people from a wide range of professions to make a career change with the goal of becoming a data scientist.despite the high demand for data scientists, it is a highly challenging task to find your first job. unless you have a solid prior job experience, interviews are where you can show you skills and impress your potential employer.data science is an interdisciplinary field which covers a broad range of topics and concepts. thus, the number of questions that you might be asked at an interview is very high.however, there are some questions about the fundamentals in data science and machine learning. these are the ones you do not want to miss. in this article, we will go over 10 questions that are likely to be asked at a data scientist interview.the questions are grouped into 3 main categories which are machine learning, python, and SQL. i will try to provide a brief answer for each question. however, i suggest reading or studying each o

(e) So far we've tried the capitalization options we discussed in lecture. Now, think up your own capitalization rule (anything at all is fine, though try to think of what might be most useful for your text and goals). What rule will you implement?

Lowering down the first word of a sentence is a good idea for capitalization rule. However, there may be possible situtation, for example, \"SQL is a good tool to manage relational database and execute various operations on data.\" We cannot directly lower the first letter like \"sQL\".

There is plenty capitalized proper nouns using in the field of Data Science. A potential solution is to maintain a dictionary of these nouns. However, we barely see these words (even in lowercase) pointing to another kind of meaning, thus we could ignore this step. Not mentioned that it's time-consuming and labor consuming to maintain a updated dictionary as new terms in data science emerge pretty fast recent years.

(f) Go on, implement it!

In [3]:
text = """The popularity of data science attracts a lot of people from a wide range of professions to make a career change with the goal of becoming a data scientist.Despite the high demand for data scientists, it is a highly challenging task to find your first job. Unless you have a solid prior job experience, interviews are where you can show you skills and impress your potential employer.Data science is an interdisciplinary field which covers a broad range of topics and concepts. Thus, the number of questions that you might be asked at an interview is very high.However, there are some questions about the fundamentals in data science and machine learning. These are the ones you do not want to miss. In this article, we will go over 10 questions that are likely to be asked at a data scientist interview.The questions are grouped into 3 main categories which are machine learning, Python, and SQL. I will try to provide a brief answer for each question. However, I suggest reading or studying each one in more detail afterwards.Overfitting in machine learning occurs when your model is not generalized well. The model is too focused on the training set. It captures a lot of detail or even noise in the training set. Thus, it fails to capture the general trend or the relationships in the data. If a model is too complex compared to the data, it will probably be overfitting.A strong indicator of overfitting is the high difference between the accuracy of training and test sets. Overfit models usually have very high accuracy on the training set but the test accuracy is usually unpredictable and much lower than the training accuracy.We can reduce overfitting by making the model more generalized which means it should be more focused on the general trend rather than specific details.If it is possible, collecting more data is an efficient way to reduce overfitting. You will be giving more juice to the model so it will have more material to learn from. Data is always valuable especially for machine learning models.Another method to reduce overfitting is to reduce the complexity of the model. If a model is too complex for a given task, it will likely result in overfitting. In such cases, we should look for simpler models.We have mentioned that the main reason for overfitting is a model being more complex than necessary. Regularization is a method for reducing the model complexity.It does so by penalizing higher terms in the model. With the addition of a regularization term, the model tries to minimize both loss and complexity.Two main types of regularization are L1 and L2 regularization. L1 regularization subtracts a small amount from the weights of uninformative features at each iteration. Thus, it causes these weights to eventually become zero.On the other hand, L2 regularization removes a small percentage from the weights at each iteration. These weights will get closer to zero but never actually become 0.Both are machine learning tasks. Classification is a supervised learning task so we have labelled observations (i.e. data points). We train a model with labelled data and expect it to predict the labels of new data.For instance, spam email detection is a classification task. We provide a model with several emails marked as spam or not spam. After the model is trained with those emails, it will evaluate the new emails appropriately.Clustering is an unsupervised learning task so the observations do not have any labels. The model is expected to evaluate the observations and group them into clusters. Similar observations are placed into the same cluster.In the optimal case, the observations in the same cluster are as close to each other as possible and the different clusters are as far apart as possible. An example of a clustering task would be grouping customers based on their shopping behavior.The built-in data structures are of crucial importance. Thus, you should be familiar with what they are and how to interact with them. List, dictionary, set, and tuple are 4 main built-in data structures in Python.The main difference between lists and tuples is mutability. Lists are mutable so we can manipulate them by adding or removing items.On the other hand, tuples are immutable. Although we can access each element in a tuple, we cannot modify its content.One important point to mention here is that although tuples are immutable, they can contain mutable elements such as lists or sets.Let’s do an example to demonstrate the main difference between lists and sets.As we notice in the resulting objects, the list contains all the characters in the string whereas the set only contains unique values.Another difference is that the characters in the list are ordered based on their location in the string. However, there is no order associated with the characters in the set.Here is a table that summarizes the main characteristics of lists, tuples, and sets.A dictionary in Python is a collection of key-value pairs. It is similar to a list in the sense that each item in a list has an associated index starting from 0.In a dictionary, we have keys as the index. Thus, we can access a value by using its key.The keys in a dictionary are unique which makes sense because they act like an address for the values.SQL is an extremely important skill for data scientists. There are quite a number of companies that store their data in a relational database. SQL is what is needed to interact with relational databases.You will probably be asked a question that involves writing a query to perform a specific task. You might also be asked a question about general database knowledge.Consider we have a sales table that contains daily sales quantities of products.Find the top 5 weeks in terms of total weekly sales quantities.We first extract the year and week information from the date column and then use it in the aggregation. The sum function is used to calculate the total sales quantities.In the same sales table, find the number of unique items that are sold each month.These terms are related to database schema design. Normalization and denormalization aim to optimize different metrics.The goal of normalization is to reduce data redundancy and inconsistency by increasing the number of tables. On the other hand, denormalization aims to speed up the query execution. Denormalization decreases the number of tables but at the same time, it adds some redundancy.It is a challenging task to become a data scientist. It requires time, effort, and dedication. Without having prior job experience, the process gets harder.Interviews are very important to demonstrate your skills. In this article, we have covered 10 questions that you are likely to encounter in a data scientist interview.Thank you for reading. Please let me know if you have any feedback."""
text_lower = text.lower()

(g) Now that you've explored some capitalization options, which one do you think is the "best" fit for your text and goals? (And here's where you can comment out anything you won't use going forward and then re-run what's left so it's in for the shape you want for the next questions.)

In this homework, I will choose to just lower down the whole text. 

(h) Why do you think the rule you chose in 2g is the most appropriate for your work? Give one strength and one weakness of the rule you've chosen.

Though it's the most single method to deal with capitalization, it is quick enough and could well serve my following analysis since I care about the topics covered most in data science blogs.

The downside would be exactly the one I described before about proper nouns used in the blog. If we lower down the whole text, they may appear in lowercases formats  in my result rather than standard formats usually seen.

## 3. Punctuation

(a) We're going to go through a similar process for punctuation as we did with capitalization, but don't worry, we'll pick up the pace a bit. First, do you think you'll want punctuation, or not, or something in between? Briefly explain your answer.

I will removing punctuation since I'm not focusing on the usage of punctuation in blog post.

(b) Remove whatever punctuation you think is appropriate for your work. You may need to try a few versions. There's no need to show us anything but the final one in this case.

In [4]:
allowed_punct = ['\'','\"','%','+','-','<','=','>','^','*','/','(',')','[',']',':']
punct = string.punctuation
for i in allowed_punct:
    punct = punct.replace(i,"")

In [5]:
text_lower_nopunct = ""
for letter in text_lower:
    if letter in punct:
        text_lower_nopunct += " "
    else:
        text_lower_nopunct += letter
text_lower_nopunct

'the popularity of data science attracts a lot of people from a wide range of professions to make a career change with the goal of becoming a data scientist despite the high demand for data scientists  it is a highly challenging task to find your first job  unless you have a solid prior job experience  interviews are where you can show you skills and impress your potential employer data science is an interdisciplinary field which covers a broad range of topics and concepts  thus  the number of questions that you might be asked at an interview is very high however  there are some questions about the fundamentals in data science and machine learning  these are the ones you do not want to miss  in this article  we will go over 10 questions that are likely to be asked at a data scientist interview the questions are grouped into 3 main categories which are machine learning  python  and sql  i will try to provide a brief answer for each question  however  i suggest reading or studying each o

(c) What punctuation rule did you land on? Why did you decide on this one? Briefly walk us through the options you considered, if any. (If it was crystal clear to you what you wanted to do with the punctuation, just explain why.)

I remove commonly seen puntuations with function of separating sentences, but keep puntuations used in mathematical expression or with function of providing explanation. 

This is a simple and fast appraoch that could already well prepare the text for my following analysis. There could be plenty mathematical expressions in blog post for education in data science, and I could completely keep them in this way.

(d) This isn't *exactly* punctuation, but let's do it here: Go ahead and remove any other miscellaneous text, symbols, or tags, if there are any, that you don't need. If there aren't any, just say so! (You don't even have to explain why you're doing it (for once!). We *get* it.)

In [6]:
not_wanted_phrase = ['.com', 'towardsdatascience']
for ph in not_wanted_phrase:
    text_lower_nopunct = text_lower_nopunct.replace(ph,'')

## 4. Stop words

(a) It's time to talk stop words! Before we get into which stop word list to use or what other stop words there might be for your text, share what you think will be right for you: drop most "standard" stop words, keep them, add some stop words of your own, or some combination? Briefly explain your reasoning.

I can drop most "standard" stop words here, because the posts are usually written in standard English with relatively formal grammar and phrases.

(b) It's sometimes a little tricky to tell where the most useful line is between the benefit of simplification and the loss of substance when it comes to stop words in a particular text. Explore their usefulness for your text by first dropping all stop words from the NLTK stop word list.

Tokens without Stopwords: 603


[nltk_data] Downloading package punkt to /Users/yuxia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/yuxia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['popularity',
 'data',
 'science',
 'attracts',
 'lot',
 'people',
 'wide',
 'range',
 'professions',
 'make',
 'career',
 'change',
 'goal',
 'becoming',
 'data',
 'scientist',
 'despite',
 'high',
 'demand',
 'data',
 'scientists',
 'highly',
 'challenging',
 'task',
 'find',
 'first',
 'job',
 'unless',
 'solid',
 'prior',
 'job',
 'experience',
 'interviews',
 'show',
 'skills',
 'impress',
 'potential',
 'employer',
 'data',
 'science',
 'interdisciplinary',
 'field',
 'covers',
 'broad',
 'range',
 'topics',
 'concepts',
 'thus',
 'number',
 'questions',
 'might',
 'asked',
 'interview',
 'high',
 'however',
 'questions',
 'fundamentals',
 'data',
 'science',
 'machine',
 'learning',
 'ones',
 'want',
 'miss',
 'article',
 'go',
 '10',
 'questions',
 'likely',
 'asked',
 'data',
 'scientist',
 'interview',
 'questions',
 'grouped',
 '3',
 'main',
 'categories',
 'machine',
 'learning',
 'python',
 'sql',
 'try',
 'provide',
 'brief',
 'answer',
 'question',
 'however',
 'suggest

(c) How do you think the NLTK stop words did? Do you think removing them is generally helpful for your analysis? Why or why not? 

(By the way, I know you're not doing a *full* analysis, so it's ok if at this point you're thinking -- hang on, well, if eventually I want to do X, I wouldn't need stop words, but if I wanted to do Y, I probably would -- feel free to tell us that, or just pick a general direction from your hypotheses above and steer towards that. 

I *also* know we haven't gotten far in terms of techniques so it may not be obvious what you even *could* eventually do. But trust your curiosity and instincts! It's ok if it's not something you ultimately end up doing, or even is feasible long term. You can also be quite general, like "understand trends in X over time" and leave it more or less as that.)

Besides stopwords in nltk, I also add several words commonly seen in this kind of blog: \"towardsdatascience\", the column name and \".com\" that authors tend to write their contact at the very end.

After removing all stopwords, the word count of the post decreases over 40% to 603, while originally it\' 1165

(d) Put the NLTK stop words back in, and try removing stop words from a *different* list. It doesn't matter what list it is as long as it's not from NLTK or your own brain (yet).

In [9]:
len(tokens)

1170

In [10]:
stopwords_custom = ["towardsdatascience",".com","is","are","will","would","wouldn't","can","could","couldn't","shall","should", "shouldn't","have","having","going","aren't","am","isn't","it","they","he","she","him","her","them","you","your","what","when","who","where","how"]
tokens_no_sw_custom = []
for token in tokens:
    if token not in stopwords_custom:
        tokens_no_sw_custom.append(token)
print(f'Tokens without Stopwords: {len(tokens_no_sw_custom)}')
tokens_no_sw_custom

Tokens without Stopwords: 1038


['the',
 'popularity',
 'of',
 'data',
 'science',
 'attracts',
 'a',
 'lot',
 'of',
 'people',
 'from',
 'a',
 'wide',
 'range',
 'of',
 'professions',
 'to',
 'make',
 'a',
 'career',
 'change',
 'with',
 'the',
 'goal',
 'of',
 'becoming',
 'a',
 'data',
 'scientist',
 'despite',
 'the',
 'high',
 'demand',
 'for',
 'data',
 'scientists',
 'a',
 'highly',
 'challenging',
 'task',
 'to',
 'find',
 'first',
 'job',
 'unless',
 'a',
 'solid',
 'prior',
 'job',
 'experience',
 'interviews',
 'show',
 'skills',
 'and',
 'impress',
 'potential',
 'employer',
 'data',
 'science',
 'an',
 'interdisciplinary',
 'field',
 'which',
 'covers',
 'a',
 'broad',
 'range',
 'of',
 'topics',
 'and',
 'concepts',
 'thus',
 'the',
 'number',
 'of',
 'questions',
 'that',
 'might',
 'be',
 'asked',
 'at',
 'an',
 'interview',
 'very',
 'high',
 'however',
 'there',
 'some',
 'questions',
 'about',
 'the',
 'fundamentals',
 'in',
 'data',
 'science',
 'and',
 'machine',
 'learning',
 'these',
 'the',
 '

(e) What stop word list did you use, and why did you choose it?

In [11]:
check = ", ".join(stopwords_custom)
check

"towardsdatascience, .com, is, are, will, would, wouldn't, can, could, couldn't, shall, should, shouldn't, have, having, going, aren't, am, isn't, it, they, he, she, him, her, them, you, your, what, when, who, where, how"

The customized word list is listed above. They are mostly verbs or pronouns or others that have no concrete meaning but make the sentence grammarly correct and understandable.

(f) How do you think it did? Do you think it's more useful than the NLTK list, or not? Why or why not?

It does not perform as well as NLTK list does, because the customized list is a quite short one that I comed up with myself.

A good strategy for latter use is to start from the NLTK stopwords list, and slightly modify the list according to the task needs.

(g) And now the moment you've all been waiting for -- it's time to think about some potentially useful stop words to explore dropping that might be unique to your text. First, what might some of these stop words be? Feel free to write some code below to work out what might be helpful (no need if that's not necessary). (Note: your final decision might be to remove no stop words, not even your own, but for now you must choose at least one!)

In [12]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords.append('.com')
stopwords.append('towardsdatascience')

(h) Go ahead and drop your unique stop words! (You may do so while retaining previous dropped stop words, or not, depending on what is more useful for you.)

In [13]:
# remove stop words
tokens_no_sw_nltk_custom = []
for token in tokens:
    if token not in stopwords:
        tokens_no_sw_nltk_custom.append(token)
print(f'Tokens without Stopwords: {len(tokens_no_sw_nltk_custom)}')
tokens_no_sw_nltk_custom

Tokens without Stopwords: 603


['popularity',
 'data',
 'science',
 'attracts',
 'lot',
 'people',
 'wide',
 'range',
 'professions',
 'make',
 'career',
 'change',
 'goal',
 'becoming',
 'data',
 'scientist',
 'despite',
 'high',
 'demand',
 'data',
 'scientists',
 'highly',
 'challenging',
 'task',
 'find',
 'first',
 'job',
 'unless',
 'solid',
 'prior',
 'job',
 'experience',
 'interviews',
 'show',
 'skills',
 'impress',
 'potential',
 'employer',
 'data',
 'science',
 'interdisciplinary',
 'field',
 'covers',
 'broad',
 'range',
 'topics',
 'concepts',
 'thus',
 'number',
 'questions',
 'might',
 'asked',
 'interview',
 'high',
 'however',
 'questions',
 'fundamentals',
 'data',
 'science',
 'machine',
 'learning',
 'ones',
 'want',
 'miss',
 'article',
 'go',
 '10',
 'questions',
 'likely',
 'asked',
 'data',
 'scientist',
 'interview',
 'questions',
 'grouped',
 '3',
 'main',
 'categories',
 'machine',
 'learning',
 'python',
 'sql',
 'try',
 'provide',
 'brief',
 'answer',
 'question',
 'however',
 'suggest

(i) Having explored a few angles on stop words, what version do you think is best? (You could also choose, e.g., a subset of an existing list, or a subset + a few unique ones -- anything. If you do something outside of what we've already done anywhere in Question 4, just include your code below.) Briefly describe your stop word strategy and why you think it's the most useful. (As before, you may comment out any stop word code that you ultimately won't want to use going forward.)

The last one is the best, which I add some words to nltk stopwords list and form a customized stopword list.

In such a way, we could still deal with most stopwords in English using nltk, but also include words that often appear in the towardsdatascience posts because those words couldn\'t provide us much information when analyzing texts.

## 5. Tokenize this!

(a) Go ahead and tokenize your text!

I tokenize the text before dealing with stopwords

(b) How many *tokens* are in your corpus?

In [14]:
print(f'there are {len(tokens_no_sw_nltk_custom)} tokens in my corpus')

there are 603 tokens in my corpus


(c) How many *types* are in your corpus?

In [15]:
myset = set(tokens_no_sw_nltk_custom)
print(f'There are {len(myset)} types in my corpus')

There are 345 types in my corpus


(d) How many *terms* are in your corpus? How did you come to this number?

In [16]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

#review = imdb.text[20]
myset_stem_por = set()
for token in tokens_no_sw_nltk_custom:
    myset_stem_por.add(ps.stem(token))
print(f'There are {len(myset_stem_por)} terms in my corpus.')

There are 288 terms in my corpus.


I stem all tokens to the root and count how many distinct tokens are there.

(e) Is there anything else you'd like to share about your tokenizing experience? It's ok if the answer is "no", as tokenizing is (probably) the least controversial of all the steps. (Having said that, I'm sure the tokenization wars on Twitter will now blow up in my face!)

Nope.

One observation is that I also tried to deal with stopwords before tokenizing the text, where I used string.replace (stopword, ' ') method to drop stopwords. However it resulted in a mess because it\'s possible for a stopword also exists as part another word, for example stopword \"he\" and word \"theater\". Thus I choosed to tokenize the text first and then drop stopwords.

## 5. POS tagging

(a) In lecture we talked about several strategies for POS tagging. Two of them were *lexical-based* and *rule-based*. Tag your corpus according to a *lexical-based* strategy.

In [17]:
nltk.download('averaged_perceptron_tagger')
# lexical-based POS tagging
nltk.pos_tag(tokens_no_sw_nltk_custom)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/yuxia/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('popularity', 'NN'),
 ('data', 'NNS'),
 ('science', 'NN'),
 ('attracts', 'VBZ'),
 ('lot', 'JJ'),
 ('people', 'NNS'),
 ('wide', 'JJ'),
 ('range', 'VBP'),
 ('professions', 'NNS'),
 ('make', 'VBP'),
 ('career', 'NN'),
 ('change', 'NN'),
 ('goal', 'NN'),
 ('becoming', 'VBG'),
 ('data', 'NNS'),
 ('scientist', 'NN'),
 ('despite', 'IN'),
 ('high', 'JJ'),
 ('demand', 'NN'),
 ('data', 'NNS'),
 ('scientists', 'NNS'),
 ('highly', 'RB'),
 ('challenging', 'VBG'),
 ('task', 'NN'),
 ('find', 'VBP'),
 ('first', 'JJ'),
 ('job', 'NN'),
 ('unless', 'IN'),
 ('solid', 'JJ'),
 ('prior', 'JJ'),
 ('job', 'NN'),
 ('experience', 'NN'),
 ('interviews', 'NNS'),
 ('show', 'VBP'),
 ('skills', 'VBZ'),
 ('impress', 'JJ'),
 ('potential', 'JJ'),
 ('employer', 'NN'),
 ('data', 'NNS'),
 ('science', 'NN'),
 ('interdisciplinary', 'JJ'),
 ('field', 'NN'),
 ('covers', 'VBZ'),
 ('broad', 'JJ'),
 ('range', 'NN'),
 ('topics', 'NNS'),
 ('concepts', 'NNS'),
 ('thus', 'RB'),
 ('number', 'NN'),
 ('questions', 'NNS'),
 ('might', '

(b) Now, remove those tags and instead tag your corpus according to a *rule-based* strategy.

In [22]:
# rule-based tagging (if-then statements)

tags = []
i = 0
for token in tokens_no_sw_nltk_custom:
  tag = 'UNK'
  if token == 'data':
    tag = 'NNS'
  if token == 'science':
    if tags[i-1] == 'NNS':
        tag = 'NN'
  if len(tags) > 1:
    if tags[i-1] == 'NN':
      tag = 'VB'
  tags.append(tag)
  i+= 1

for i in range(len(tags)):
  print(tokens_no_sw_nltk_custom[i], tags[i])

popularity UNK
data NNS
science NN
attracts VB
lot UNK
people UNK
wide UNK
range UNK
professions UNK
make UNK
career UNK
change UNK
goal UNK
becoming UNK
data NNS
scientist UNK
despite UNK
high UNK
demand UNK
data NNS
scientists UNK
highly UNK
challenging UNK
task UNK
find UNK
first UNK
job UNK
unless UNK
solid UNK
prior UNK
job UNK
experience UNK
interviews UNK
show UNK
skills UNK
impress UNK
potential UNK
employer UNK
data NNS
science NN
interdisciplinary VB
field UNK
covers UNK
broad UNK
range UNK
topics UNK
concepts UNK
thus UNK
number UNK
questions UNK
might UNK
asked UNK
interview UNK
high UNK
however UNK
questions UNK
fundamentals UNK
data NNS
science NN
machine VB
learning UNK
ones UNK
want UNK
miss UNK
article UNK
go UNK
10 UNK
questions UNK
likely UNK
asked UNK
data NNS
scientist UNK
interview UNK
questions UNK
grouped UNK
3 UNK
main UNK
categories UNK
machine UNK
learning UNK
python UNK
sql UNK
try UNK
provide UNK
brief UNK
answer UNK
question UNK
however UNK
suggest UNK
rea

(c) Which strategy do you think is more useful for your analysis, and why? (As ever, you may comment out the one you don't choose.)

I prefer lexical-based tagging, because it is much more efficient in large text, especially when we don't have knowledge of what's the text looking like.

## 6. Let's stem!!

(a) In lecture we discussed two stemmers: Porter and Lancaster. Apply the Porter stemmer to your text.

In [23]:
#from nltk.stem import PorterStemmer
ps = PorterStemmer()
for token in tokens_no_sw_nltk_custom:
    root = ps.stem(token)
    print([token, root])

['popularity', 'popular']
['data', 'data']
['science', 'scienc']
['attracts', 'attract']
['lot', 'lot']
['people', 'peopl']
['wide', 'wide']
['range', 'rang']
['professions', 'profess']
['make', 'make']
['career', 'career']
['change', 'chang']
['goal', 'goal']
['becoming', 'becom']
['data', 'data']
['scientist', 'scientist']
['despite', 'despit']
['high', 'high']
['demand', 'demand']
['data', 'data']
['scientists', 'scientist']
['highly', 'highli']
['challenging', 'challeng']
['task', 'task']
['find', 'find']
['first', 'first']
['job', 'job']
['unless', 'unless']
['solid', 'solid']
['prior', 'prior']
['job', 'job']
['experience', 'experi']
['interviews', 'interview']
['show', 'show']
['skills', 'skill']
['impress', 'impress']
['potential', 'potenti']
['employer', 'employ']
['data', 'data']
['science', 'scienc']
['interdisciplinary', 'interdisciplinari']
['field', 'field']
['covers', 'cover']
['broad', 'broad']
['range', 'rang']
['topics', 'topic']
['concepts', 'concept']
['thus', 'thu'

(b) How did it do? Do you think the Porter stemmer is useful for your work? Why or why not? 

With stemming, the number of distinct values decrease to 288. Tokens can be aggregated to less groups for a faster analysis.

(c) Undo the Porter stemmer and instead apply the Lancaster stemmer to your text.

In [24]:
from nltk.stem.lancaster import LancasterStemmer
ls = LancasterStemmer()

myset_stem_lan = set()
for token in tokens_no_sw_nltk_custom:
    root = ls.stem(token)
    myset_stem_lan.add(root)
    print([token, root])

print(f'There are {len(myset_stem_lan)} terms in my corpus.')

['popularity', 'popul']
['data', 'dat']
['science', 'sci']
['attracts', 'attract']
['lot', 'lot']
['people', 'peopl']
['wide', 'wid']
['range', 'rang']
['professions', 'profess']
['make', 'mak']
['career', 'car']
['change', 'chang']
['goal', 'goal']
['becoming', 'becom']
['data', 'dat']
['scientist', 'sci']
['despite', 'despit']
['high', 'high']
['demand', 'demand']
['data', 'dat']
['scientists', 'sci']
['highly', 'high']
['challenging', 'challeng']
['task', 'task']
['find', 'find']
['first', 'first']
['job', 'job']
['unless', 'unless']
['solid', 'solid']
['prior', 'pri']
['job', 'job']
['experience', 'expery']
['interviews', 'interview']
['show', 'show']
['skills', 'skil']
['impress', 'impress']
['potential', 'pot']
['employer', 'employ']
['data', 'dat']
['science', 'sci']
['interdisciplinary', 'interdisciplin']
['field', 'field']
['covers', 'cov']
['broad', 'broad']
['range', 'rang']
['topics', 'top']
['concepts', 'conceiv']
['thus', 'thu']
['number', 'numb']
['questions', 'quest']
[

(d) How did the Lancaster stemmer do? Do you think it's more or less useful for your work than the Porter stemmer? Briefly explain your reasoning.

Lancaster stemmer lower down the number of unique tokens to 274.

In [25]:
print(f'Comparing the difference between two results, we find PortStemmer result has following result that not in Lancaster stemmer: \n{myset_stem_por - myset_stem_lan}')
print(f'Lancaster stemmer result has following result that not in PortStemmer: \n{myset_stem_lan - myset_stem_por}')

Comparing the difference between two results, we find PortStemmer result has following result that not in Lancaster stemmer: 
{'prior', 'experi', 'base', 'type', 'date', 'harder', 'regular', 'one', 'place', 'manipul', 'time', 'materi', 'skill', 'normal', 'unsupervis', 'use', 'cover', 'daili', 'behavior', 'concept', 'never', 'possibl', 'scientist', 'gener', 'demonstr', 'notic', 'iter', 'indic', 'career', 'element', 'wide', 'like', 'compar', 'case', 'relat', 'number', 'lower', 'question', 'product', 'categori', 'uniqu', 'relationship', 'speed', 'fundament', 'especi', 'even', 'sever', 'captur', 'interdisciplinari', 'answer', 'rather', 'effici', 'mention', 'associ', 'make', 'structur', 'metric', 'topic', 'summar', 'scienc', 'data', 'key-valu', 'mutabl', 'actual', 'dedic', 'characterist', 'given', 'highli', 'closer', 'classif', 'function', 'give', 'total', 'penal', 'necessari', 'potenti', 'differ', 'queri', 'compani', 'locat', 'add', 'crucial', 'valuabl', 'cluster', 'articl', 'probabl', 'di

I personally prefer the result get from the PortStemmer, because those are more readable to me. For example, the word \"fundamental\" is written as \"fundament\" in PortStemmer result, but \"funda\" in LancasterStemmer result.

## 7. Last but (well, maybe) not least ... Lemmatization!

(a) Go on, Lemmatize! You may do so in any way you see fit.

In [26]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()

for token in tokens_no_sw_nltk_custom:
    print(f'[{token}, {wordnet_lemmatizer.lemmatize(token)}]')

[nltk_data] Downloading package wordnet to /Users/yuxia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[popularity, popularity]
[data, data]
[science, science]
[attracts, attracts]
[lot, lot]
[people, people]
[wide, wide]
[range, range]
[professions, profession]
[make, make]
[career, career]
[change, change]
[goal, goal]
[becoming, becoming]
[data, data]
[scientist, scientist]
[despite, despite]
[high, high]
[demand, demand]
[data, data]
[scientists, scientist]
[highly, highly]
[challenging, challenging]
[task, task]
[find, find]
[first, first]
[job, job]
[unless, unless]
[solid, solid]
[prior, prior]
[job, job]
[experience, experience]
[interviews, interview]
[show, show]
[skills, skill]
[impress, impress]
[potential, potential]
[employer, employer]
[data, data]
[science, science]
[interdisciplinary, interdisciplinary]
[field, field]
[covers, cover]
[broad, broad]
[range, range]
[topics, topic]
[concepts, concept]
[thus, thus]
[number, number]
[questions, question]
[might, might]
[asked, asked]
[interview, interview]
[high, high]
[however, however]
[questions, question]
[fundamentals, 

(b) Why did you choose the Lemmatizer that you did? (Note: "It seemed like the easiest one" is a fine answer for this assignment!)

It's the most basic lemmatizer to use and I only find this lemmatizer class in the nltk document

(c) What do you think is more useful for your work, the winning stemmer from Question 6, or the Lemmatizer? Briefly explain your reasoning. In doing so, please comment on the tradeoffs between the two choices and why you landed where you did.

I prefer to use the Lemmatizer, since it not only finds the root, but also returns a full root word. Thus lemmatizer makes the downstream analysis easy to understand and visualize.

## 8. Summing up

(a) Hopefully you've had a chance to experience, without getting *too* irritated at me for a few touches of tedium in this assignment, that even in this *relatively* uncontroversial pre-processing stage, you had to make a number of choices about how to clean and standardize your text. 

Overall, which choice was the most difficult to make, and which was the easiest, or most "obvious" for you? (Even if none were particularly difficult or particularly easy -- all possible depending on your text -- try to pull out at least the extremes!) Briefly explain your answer.

I find dealing with capitalization is a challenging question. Capitalization format can also deliver information, but capitalization also provides different formats of a individual word that could make analysis difficult. Thus to find the balance between lowering the whole text and keep all capitalization is not easy. There\'s no universal rule for dealing with capitalization, and each decision is made just for the specific type of corpus.

The way I deal with punctuation is pretty straightforwards that's to directly drop the one separating sentences because they do not transfer useful information about the blog content.

(b) How confident are you in the choices you've made? In other words, if you were to proceed with analyzing this text, how likely do you think it is that you'd eventually want to go back and tweak (or totally change) some of the decisions you've made? What do you think you'd be most likely to change? Briefly explain your reasoning. If you expect to be 100 percent confident in all your choices in this assignment until the end of time, explain why.

If I collect more sample texts, I may go ahead to experiment for the best practice for dealing with stemming or lemmatizing. The consideration of change token to word root is for a more smooth, efficient, and reasonable analysis later. Thus, pre-knowledge about how words are roughly used in data science blog could help make a better decision on which stemmer or lemmatizer to use.

However, I only inspect one article for this homework, so limited information of words use could be observed, so the choice of using lemmatizer is quite temporary, and is flexible to future change.

(c) We haven't learned much (yet!!) about what to actually *do* with text once it's pre-processed, but now that you have it, what do *you* imagine the next step would be in your analysis in order to test (or at least get closer to being ready to test) one or all of the hypotheses you identified in Question 1? Go with your intincts -- I bet there's a technique for it (and if there isn't, well, now you're a future NLP methods developer!).

After getting the list of tokens, I imagine next step to be vectorize tokens so that the computer could understand the number and potentially find relationships between tokens using machine learning tecniques.

(d) Finally, while, again, we haven't *really* done any analysis yet, what's something you've learned about your text from this pre-processing work? No lesson is too large or too small.

I can indeed see the complex side of language from this homework. I never clearly realize how many variations of words, choice of non-letter characters, rules of grammar in a language. This is a challenge when we want to summarize a standardized system that the computer could understand. We human could learn a new language with ongoing exposure of that specific language, but computer don\' know how to build a readable connection between words easily.

# The end! 

Congratulations on finishing your first assignment, and your first NLP work ever (for most of you, at least)! Don't forget to comment out (not delete!) the code that you decided is not appropriate for your work. In other words, leave the final version of this notebook such that we can run *your* analysis top to bottom.