# 1. Software Engineering Practices


This lesson uses classroom workspaces that contain all of the files and functionality you will need. You can also find the files in the data scientist nanodegree term 2 GitHub repo.
https://github.com/udacity/DSND_Term2/tree/master/lessons/ObjectOrientedProgramming


PRODUCTION CODE: software running on production servers to handle live users and data of the intended audience. Note this is different from production quality code, which describes code that meets expectations in reliability, efficiency, etc., for production. Ideally, all code in production meets these expectations, but this is not always the case.

CLEAN: readable, simple, and concise. A characteristic of production quality code that is crucial for collaboration and maintainability in software development.

MODULAR: logically broken up into functions and modules. Also an important characteristic of production quality code that makes your code more organized, efficient, and reusable. 
- DRY (don't repeat yourself).
- Abstraction
- Minimize the number of entities (functions, classes, modules)
- Functions should do 1 thing
- Arbitrary variable names can be more effective in certain functions (`for i in list...`)



MODULE: a file. Modules allow code to be reused by encapsulating them into files that can be imported into other files.


## Refactoring Code

restructuring your code to improve its internal structure, without changing its external functionality. This gives you a chance to clean and modularize your program after you've got it working.


Since it isn't easy to write your best code while you're still trying to just get it working, allocating time to do this is essential to producing high quality code. Despite the initial time and effort required, this really pays off by speeding up your development time in the long run.

Example notebook:

df with columns: 	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality


```def numeric_to_buckets(df, column_name):
    median = df[column_name].median()
    for i, val in enumerate(df[column_name]):
        if val >= median:
            df.loc[i, column_name] = 'high'
        else:
            df.loc[i, column_name] = 'low' 

for feature in df.columns[:-1]:
    numeric_to_buckets(df, feature)
    print(df.groupby(feature).quality.mean(), '\n')
```
   
output:
    
    
fixed_acidity


high    5.726061

low     5.540052

Name: quality, dtype: float64 


volatile_acidity

high    5.392157

low     5.890166

Name: quality, dtype: float64 


etc


## Code Efficiency:

reduces time and space in memory

To make computations faster --> use vector operations (numpy) instead of loops

Exaxmple: 

`np.intersect1d` is faster than looping through two df-s, but slower then `np.intersection` after conversion of the list/df/array to set


`np.where` iterates through array elements that meet certain condition


`total_price = np.where(gift_costs < 25).sum() * 1.08` instead of looping


`df['first_name'], df['last_name'] = df['name'].str.split(' ', 1).str`


### Readme-s:

At a minimum, this should explain what it does, list its dependencies, and provide sufficiently detailed instructions on how to use it.

## CI

https://algorithmia.com/blog/how-to-version-control-your-production-machine-learning-models



### Merge Conflicts https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-merge-conflicts

## Test Driven Development and Data Science


TEST DRIVEN DEVELOPMENT: a development process where you write tests for tasks before you even write the code to implement those tasks.

UNIT TEST: a type of test that covers a “unit” of code, usually a single function, independently from the rest of the program.

To install pytest, run pip install -U pytest in your terminal. Create a test file starting with test_.Define unit test functions that start with test_ inside the test file. Enter pytest into your terminal in the directory of your test file and it will detect these tests for you!

test_ is the default - if you wish to change this, you can learn how to in this pytest configuration. In the test output, periods represent successful unit tests and F's represent failed unit tests. Since all you see is what test functions failed, it's wise to have only one assert statement per test. Otherwise, you wouldn't know exactly how many tests failed, and which tests failed.

Your tests won't be stopped by failed assert statements, but it will stop if you have syntax errors.
INTEGRATION TEST: To show that all the parts of our program work with each other properly, communicating and transferring data between them correctly, we use integration tests. https://www.fullstackpython.com/integration-testing.html

Resources:

Four Ways Data Science Goes Wrong and How Test Driven Data Analysis Can Help: Blog Post https://www.predictiveanalyticsworld.com/machinelearningtimes/four-ways-data-science-goes-wrong-and-how-test-driven-data-analysis-can-help/6947/


Ned Batchelder: Getting Started Testing: Slide Deck and Presentation Video https://speakerdeck.com/pycon2014/getting-started-testing-by-ned-batchelder



## Logging

DEBUG - level you would use for anything that happens in the program.
ERROR - level to record any error that occurs
INFO - level to record all actions that are user-driven or system specific, such as regularly scheduled operations

## Code Reviews

https://www.fullstackpython.com/integration-testing.html

https://github.com/lyst/MakingLyst/tree/master/code-reviews



Is the code clean and modular?
Can I understand the code easily?
Does it use meaningful names and whitespace?
Is there duplicated code?
Can you provide another layer of abstraction?
Is each function and module necessary?
Is each function or module too long?


Is the code efficient?
Are there loops or other steps we can vectorize?
Can we use better data structures to optimize any steps?
Can we shorten the number of calculations needed for any steps?
Can we use generators or multiprocessing to optimize any steps?


Is documentation effective?
Are in-line comments concise and meaningful?
Is there complex code that's missing documentation?
Do function use effective docstrings?
Is the necessary project documentation provided?


Is the code well tested?
Does the code high test coverage?
Do tests check for interesting cases?
Are the tests readable?
Can the tests be made more efficient?


Is the logging effective?
Are log messages clear, concise, and professional?
Do they include all relevant and useful information?
Do they use the appropriate logging level?


Using linters: https://www.pylint.org/



# 2. Object Oriented Programming

## (Project A: Create and Upload a Python Package to Pypi)
## (Project B: Develop a Daa Dashboard Using Flask, Plotly and Pandas)
## (Project C: Deploy a Data Dashboard)

https://www.youtube.com/watch?v=Y8ZVw1LHI8E&feature=emb_logo

https://www.youtube.com/watch?time_continue=279&v=NcgDIWm6iBA&feature=emb_logo

- class - a blueprint consisting of methods and attributes
- object - an instance of a class. It can help to think of objects as something in the real world like a yellow pencil, a small dog, a blue shirt, etc. However, as you'll see later in the lesson, objects can be more abstract.
- attribute - a descriptor or characteristic. Examples would be color, length, size, etc. These attributes can take on specific values like blue, 3 inches, large, etc.
- method - an action that a class or object could take
- OOP - a commonly used abbreviation for object-oriented programming
- encapsulation - one of the fundamental ideas behind object-oriented programming is called encapsulation: you can combine functions and data all into a single entity. In object-oriented programming, this single entity is called a class. Encapsulation allows you to hide implementation details much like how the scikit-learn package hides the implementation of machine learning algorithms.



Function vs method --> method doesn't return anything





A get method is for obtaining an attribute value. A set method is for changing an attribute value. If you were writing a Shirt class, the code could look like this:

```
class Shirt:

    def __init__(self, shirt_color, shirt_size, shirt_style, shirt_price):
        self._price = shirt_price

    def get_price(self):
      return self._price

    def set_price(self, new_price):
      self._price = new_price
```

This is a replacement for the ugly code such as here:

```
shirt_one.price = 10
shirt_one.price = 20
shirt_one.color = 'red'
shirt_one.size = 'M'
shirt_one.style = 'long_sleeve'
```


Instantiating and using an object might look like this:

```
shirt_one = Shirt('yellow', 'M', 'long-sleeve', 15)
print(shirt_one.get_price())
shirt_one.set_price(10)
```


 Following the Python convention, the underscore in front of price is to let a programmer know that price should only be accessed with get and set methods rather than accessing price directly with shirt_one._price. However, a programmer could still access _price directly because there is nothing in the Python language to prevent the direct access.To reiterate, a programmer could technically still do something like shirt_one._price = 10, and the code would work. But accessing price directly, in this case, would not be following the intent of how the Shirt class was designed.

Also, if you were developing a software program, you would want to modularize this code.

You would put the Shirt class into its own Python script called, say, shirt.py. And then in another Python script, you would import the Shirt class with a line like: from shirt import Shirt.

In [2]:
    def display_sales(self):
        """The display_sales method prints out all pants that have been sold

        Args: None

        Returns: None

        """

        for pants in self.pants_sold:
            print('color: {}, waist_size: {}, length: {}, price: {}'\
                  .format(pants.color, pants.waist_size, pants.length, pants.price))
    
    def calculate_sales(self):
        """The calculate_sales method sums the total price of all pants sold

        Args: None

        Returns:
            float: sum of the price for all pants sold
        
        """

        total = 0
        for pants in self.pants_sold:
            total += pants.price
            
        self.total_sales = total
        
        return total
    
    def calculate_commission(self, percentage):
        """The calculate_commission method outputs the commission based on sales

        Args:
            percentage (float): the commission percentage as a decimal

        Returns:
            float: the commission due
        """

        sales_total = self.calculate_sales()
        return sales_total * percentage 

## 2.1 Magic methods

`__add__` changes the behavior of the plus sign

`__repr__`  


## 2.2 

## 2.3 Multiple Inheritance, Mixins

https://easyaspython.com/mixins-for-fun-and-profit-cb9962760556

## 2.4 Decorators

https://realpython.com/primer-on-python-decorators/


## 2.5 Organizing into Modules

module = reusable file

package = collection o fmodules placed into a directory, also needs an __init__.py file.

accessing installed packages by `{name_of_the_package}.__file__` outputs the whole path in the machine


# 3 Data Engineering

In this project you're going to be analyzing thousands of real messages provided by Figure 8, sent during natural disasters either via social media or directly to disaster response organizations. You'll build an ETL pipeline that processes message and category data from csv files and load them into a SQLite database, which your machine learning pipeline will then read from to create and save a multi-output supervised learning model. Then, your web app will extract data from this database to provide data visualizations and use your model to classify new messages for 36 categories.

Machine learning is critical to helping different organizations understand which messages are relevant to them and which messages to prioritize. During these disasters is when they have the least capacity to filter out messages that matter, and find basic methods such as using key word searches to provide trivial results. In this course, you'll learn the skills you need in ETL pipelines, natural language processing, and machine learning pipelines to create an amazing project with real world significance.


3.1 ETL / ELT

3.2 Encoding

This link has a list of encodings that Python recognizes https://docs.python.org/3/library/codecs.html#standard-encodings

OR 

```
!pip install chardet

with open("mystery.csv", 'rb') as file:
    print(chardet.detect(file.read()))
    
```

3.3 Missing values

There are also implementations of some machine learning algorithms, such as gradient boosting decision trees that can handle missing values https://xgboost.readthedocs.io/en/latest/

Options to handle missing data:

- delete
- impute (replace by the ean e.g., or median..)
- forward fill, backward fill (with time related data)


3.4 Scaling Data

Some ML algorithms work better with scaled data (PCA, LR..)

normalization/feature scaling: rescaling (numerical data on a scale from 0 to 1), standardizations (mean = 0, stdev = 1)


3.5 Feature Engineering

useful if modesl are uderfitting


3.6 Outliers

Statistical Methods for Outlier Detection: z-scores, Turkey Method (mean, stdev, quartiles..) Like so:


Use the Tukey rule to determine what values of the population data are outliers for the year 2016. The Tukey rule finds outliers in one-dimension. The steps are:
```
Find the first quartile (ie .25 quantile)
Find the third quartile (ie .75 quantile)
Calculate the inter-quartile range (Q3 - Q1)
Any value that is greater than Q3 + 1.5 * IQR is an outlier
Any value that is less than Q1 - 1.5 * IQR is an outlier

Calculate the maximum value and minimum values according to the Tukey rule: max_value is Q3 + 1.5 * IQR while min_value is Q1 - 1.5 * IQR`


max_value = 
min_value = 


TODO: filter the population_2016 data for population values that are greater than max_value or less than min_value
```

Possible to transpose 3-D data to 2 D by using  a PCA method so that it's easier to spot visually on a line/parabola, or to cluster data and measure distance of pointfrom a centroid

https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561

https://scikit-learn.org/stable/modules/outlier_detection.html


ONLY remove if the outlier affects model performance


# 3 Submit a package

Packaging and distributing projects¶

### Udacity video class is here: https://www.youtube.com/watch?time_continue=394&v=4uosDOKn5LI&feature=emb_logo



https://packaging.python.org/tutorials/packaging-projects/


Github contribution:

https://akrabat.com/the-beginners-guide-to-contributing-to-a-github-project/

https://github.com/MarcDiethelm/contributing/blob/master/README.md


### PyPi vs. Test PyPi

Note that pypi.org and test.pypy.org are two different websites. You'll need to register separately at each website. If you only register at pypi.org, you will not be able to upload to the test.pypy.org repository.

Also, remember that your package name must be unique. If you use a package name that is already taken, you will get an error when trying to upload the package.

Summary of the Terminal Commands Use

```
# just navigate to the stage in your directory that contains your package and setup.py file

cd binomial_package_files 
                    
                    
# creates important files necessary for the package submission                    
python setup.py sdist 

# lib for package submission, that creates a .tar.gz file that should be submitted to PyPI
pip install twine 

# commands to upload to the pypi test repository
twine upload --repository-url https://test.pypi.org/legacy/ dist/*
pip install --index-url https://test.pypi.org/simple/ dsnd-probability


# command to upload to the pypi repository
twine upload dist/*
pip install dsnd-probability
```



Then you ideally create a separate venv and install the package as normally. I had to install my oop_guessing_game package as following:

`pip install oop_guessing_game`

and import it as below:

`from oop_guessing_game.oop_guessing_game import ComputerPlayer, PersonPlayer, Game`



More PyPi Resources

Tutorial on distributing packages https://packaging.python.org/tutorials/packaging-projects/

This link has a good tutorial on distributing Python packages including more configuration options for your setup.py file: tutorial on distributing packages. You'll notice that the python command to run the setup.py is 
slightly different with

python3 setup.py sdist bdist_wheel
This command will still output a folder called dist. The difference is that you will get both a .tar.gz file and a .whl file. The .tar.gz file is called a source archive whereas the .whl file is a built distribution. The .whl file is a newer type of installation file for Python packages. When you pip install a package, pip will first look for a whl file (wheel file) and if there isn't one, will then look for the tar.gz file.

A tar.gz file, ie an sdist, contains the files needed to compile and install a Python package. A whl file, ie a built distribution, only needs to be copied to the proper place for installation. Behind the scenes, pip installing a whl file has fewer steps than a tar.gz file.

Other than this command, the rest of the steps for uploading to PyPi are the same.

Other Links
If you'd like to learn more about PyPi, here are a couple of resources:


# 4 Web Development

## (Develop and Deploy Dashboard Project)


- setting up the backend
- linking the backend and the frontend together
- deploying the app to a server so that the app is available from a web address


Why Bootstrap?
Bootstrap is one of the easier front-end frameworks to work with. Bootstrap eliminates the need to write CSS or JavaScript. Instead, you can style your websites with HTML. You'll be able to design sleek, modern looking websites more quickly than if you were coding the CSS and JavaScript directly.

https://getbootstrap.com/docs/4.0/getting-started/introduction/#starter-template

https://getbootstrap.com/docs/4.0/layout/grid/

https://getbootstrap.com/docs/4.0/layout/overview/

https://getbootstrap.com/docs/4.0/content/images/

https://getbootstrap.com/docs/4.0/components/navbar/

https://getbootstrap.com/docs/4.0/utilities/colors/



### 4.1 Javascript

Jquery came out in 2006. There are newer JavaScript tools out there like React and Angular.


Difficult to code, using JQuery instead, just need to reference it  in the code by inputing:

```
<script
src = "https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js">
</script>
```



Basic JavaScript Syntax

Here are a few rules to keep in mind when writing JavaScript:

- a line of code ends with a semi-colon ;
- () parenthesis are used when calling a function much like in Python
- {} curly braces surround large chunks of code or are used when initializing dictionaries
- [] square brackets are used for accessing values from arrays or dictionaries much like in Python


Here is an example of a JavaScript function that sums the elements of an array.

```
function addValues(x) {
  var sum_array = 0;
  for (var i=0; i < x.length; i++) {   
    sum_array += x[i];
  }
  return sum_array;
}

addValues([3,4,5,6]);
```

### 4.2 Flask (BE Deployment of Web App)

Flask. A web framework takes care of all the routing needed to organize a web page so that you don't have to write the code yourself!


#### Creating New Pages

To create a new web page, you first need to specify the route in the routes.py as well as the name of the html template.

```
@app.route('/new-route')
def render_the_route():
    return render_template('new_route.html')
```

The route name, function name, and template name do not have to match; however, it's good practice to make them similar so that the code is easier to follow.

The `new_route.html` file must go in the templates folder. Flask automatically looks for html files in the templates folder.

What is @app.route? --> path you place inside of `@app.route()` will be the web address. And then the function you write below `@app.route` is used to render the correct html template file for the web address.

In Python, the @ symbol is used for decorators. Decorators are a shorthand way to input a function into another function. Take a look at this code. Python allows you to use a function as an input to another function:

```
def decorator(input_function):

    return input_function

def input_function():
    print("I am an input function")

decorator_example = decorator(input_function)
decorator_example()
```

Running this code will print the string:

`I am an input function`

Decorators provide a short-hand way of getting the same behavior:
```
def decorator(input_function):
    print("Decorator function")
    return input_function

@decorator
def input_function():
    print("I am an input function")

input_function()
```

This code will print out:

```
Decorator function
I am an input function
```

Instead of using a decorator function, you could get the same behavior with the following code:

```
input_function = decorator(input_function)
input_function()
```

Because `@app.route()` has the . symbol, there's an implication that app is a class (or an instance of a class) and route is a method of that class. Hence a function written underneath @app.route() is going to get passed into the route method. The purpose of @app.route() is to make sure the correct web address gets associated with the correct html template. This code

```
@app.route('/homepage')
def some_function()
  return render_template('index.html')
```

is ensuring that the web address 'www.website.com/homepage' is associated with the index.html template.

If you'd like to know more details about decorators and how @app.route() works, check out these tutorials:


-https://realpython.com/primer-on-python-decorators/
-https://ains.co/blog/things-which-arent-magic-flask-part-1.html




### 4.2 Bring BE operations to the FE 

Using Flask + Jinja (Template Engine)


In the video above, the data set was sent from the back end to the front end. This was accomplished by including a variable in the render_template() function like so:

```
data = data_wrangling()

@app.route('/')
@app.route('/index')
def index():
   return render_template('index.html', data_set = data)
```
   
What this code does is to first load the data using the data_wrangling function from wrangling.py. This data gets stored in a variable called data.

In render_template, that data is sent to the front end via a variable called data_set. Now the data is available to the front_end in the data_set variable.

In the index.html file, you can access the data_set variable using the following syntax:

```
{{ data_set }}
```

You can do this because Flask comes with a template engine Jinja. Jinja also allows you to put control flow statements in your html using the following syntax:

```
{% for tuple in data_set %}
  <p>{{tuple}}</p>
{% end_for %}
```

The logic is:

- Wrangle data in a file (aka Python module). In this case, the file is called wrangling.py. The wrangling.py has a function that returns the clean data.

- Execute this function in routes.py to get the data in routes.py

- Pass the data to the front-end (index.html file) using the render_template method.

- Inside of index.html, you can access the data variable with the squiggly bracket syntax {{ }}



Next, we load the data using the data_wrangling function from wrangling.py. This data gets stored in a variable called data.

In render_template, that data is sent to the front end via a variable called data_set. Now the data is available to the front_end in the data_set variable.

In the index.html file, you can access the data_set variable using the following syntax:

```
{{ data_set }}
```

You can do this because Flask comes with a template engine called Jinja. Jinja also allows you to put control flow statements in your html using the following syntax:

```
{% for tuple in data_set %}
  <p>{{tuple}}</p>
{% end_for %}

```

The logic is:

- Wrangle data in a file (aka Python module). In this case, the file is called wrangling.py. The wrangling.py has a function that returns the clean data.
- Execute this function in routes.py to get the data in routes.py
- Pass the data to the front-end (index.html file) using the render_template method.
- Inside of index.html, you can access the data variable with the squiggly bracket syntax {{ }}


### 4.4 Web App Deplyment

Other Services Besides Heroku
Heroku is just one option of many for deploying a web app, and Heroku is actually owned by Salesforce.com.

The big internet companies offer similar services like Amazon's Lightsail, Microsoft's Azure, Google Cloud, and IBM Cloud (formerly IBM Bluemix). However, these services tend to require more configuration. Most of these also come with either a free tier or a limited free tier that expires after a certain amount of time.

Instructions Deploying from the Classroom
Here is the code used in the screencast to get the web app running:

First, a new folder was created for the web app and all of the web app folders and files were moved into the folder:

```
mkdir web_app
mv -t web_app data worldbankapp wrangling_scripts worldbank.py
```

The next step was to create a virtual environment and then activate the environment:

```
conda update python
python3 -m venv worldbankvenv
source worldbankenv/bin/activate
```

Then, pip install the Python libraries needed for the web app

```
pip install flask pandas plotly gunicorn
The next step was to install the heroku command line tools:

curl https://cli-assets.heroku.com/install-ubuntu.sh | sh
https://devcenter.heroku.com/articles/heroku-cli#standalone-installation
heroku —-version
```

And then log into heroku with the following command

```
heroku login
```

Heroku asks for your account email address and password, which you type into the terminal and press enter.

The next steps involved some housekeeping:

```
remove app.run() from worldbank.py
```

type cd web_app into the Terminal so that you are inside the folder with your web app code.

Then create a proc file, which tells Heroku what to do when starting your web app:

`touch Procfile`

Then open the Procfile and type:

```
web gunicorn worldbank:app
```

Next, create a requirements file, which lists all of the Python library that your app depends on:

`pip freeze > requirements.txt`

And initialize a git repository and make a commit:

```
git init
git add .
git commit -m ‘first commit’
```

Now, create a heroku app:

`heroku create my-app-name`


where my-app-name is a unique name that nobody else on Heroku has already used.

The heroku create command should create a git repository on Heroku and a web address for accessing your web app. You can check that a remote repository was added to your git repository with the following terminal command:

```
git remote -v
```

Next, you need to push your git repository to the remote heroku repository with this command:

`git push heroku master`

Now, you can type your web app's address in the browser to see the results.

When installing a web app to a server, you should only include the packages that are necessary for running your web app. Otherwise you'd be installing Python packages that you don't need.

To ensure that your app only installs necessary packages, you should create a virtual Python environment. A virtual Python environment is a separate Python installation on your computer that you can easily remove and won't interfere with your main Python installation.

There is more than one Python package that can set up virtual environments. In the past, you had to install these packages yourself. With Python 3.6, there is a virtual environment package that comes with the Python installation. The packaged is called venv

However, there is a bug with anaconda's 3.6 Python installation on a Linux system. So in order to use venv in the workspace classroom, you first need to update the Python installation as shown in the instructions above.

#### Creating a Virtual Environment Locally on Your Computer

You can develop your app using the classroom workspace. If you decide to develop your app locally on your computer, you should set up a virtual environment there as well. Different versions of Python have different ways of setting up virtual environments. Assuming you are using Python 3.6 and are on a linux or macOS system, then you should be able to set up a virtual environment on your local machine just by typing:

`python3 -m venv name`

and then to activate:

`source name/bin/activate`

#### Databases for Your App

The web app in this lesson does not need a database. All of the data is stored in CSV files; however, it is possible to include a database as part of a Flask app. One common use case would be to store user login information such as username and password.

Flask is database agnostic meaning Flask can work with a number of different database types. If you are interested in learning about how to include a database as part of a Flask app, here are some resources:

Flask Mega Tutorial: https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-iv-database


Heroku - Provision a Database: https://devcenter.heroku.com/articles/getting-started-with-python#provision-a-database

# NLP
The 3 stages of an NLP pipeline are: Text Processing > Feature Extraction > Modeling.

Text Processing: Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
Feature Extraction: Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
Modeling: Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.
This process isn't always linear and may require additional steps.


### Data Preparation:

- Cleaning to remove irrelevant items, such as HTML tags (regular expression)
- Normalizing by converting to all lowercase and removing punctuation (`.lower()`)
- Splitting text into words or tokens (`.split()`) OR (`from nltk.tokenize import word_tokenize`, more cleaning in the tokenize package. Or `...import sent_tokenize` splits words into sentences 1 sentence being 1 token)
- Removing words that are too common, also known as stop words `from nltk.corpus import stopwords` AND THEN: words = `[w for w in words if w not in stopwords.words('english')]`
- Identifying different parts of speech and named entities `fromnltk import pos_tag`
- Converting words into their dictionary forms, using stemming and lemmatization (`[PorterStemmer().stem(w) for w in words]`) AND `[WordNetLemmatizer().lemmatize(w) for w  in words]`


## Feature Extraction

always need some sort of numerical represntation depending on the task:

1. document level task (span detection or sentiment analysis): bag of words or doc2vec. Bag of Words (a set of unnordered words in text, turned into a vector of numbers: collect vocabulary from different sources we are comparing, each word being one column, claculate occurence in each of the sources of this word while each source being one row and make a document-term matrix each number beaing a document word frequency. The greater the dot product (sum of all occurences), the bigger the similarity. However, this method treats all words with the same importance. To avoid this, we can also calculate the occurence of each of the words in all docs together and then calculate the proportion of how many occurenfces of one word does the specific doc have (e.g. 1/3 of the occurences of the word DOG is in the doc number 1). This way we can also see the rare words that only appear once.TF-IDF



![image.png](attachment:image.png)



2. individual words and phrases (text generation, machine translation): eord level representation such as word2vec or glove.

## ML Pipelines