# <font color='SEAGREEN'>Day 1</font>
# <font color='MEDIUMSEAGREEN'>(1) Fundamental of Python Programming for Text Classification</font>
Python is a popular language because it is open source, easy to learn, and has support for many popular libraries.
There are two popular editions for Python: Python 2.X and Python 3.X. Python 2.X will be out of market soon, so it is better for us to get more familiar with the third edition.

The main strength of Python for machine learning and data science is the abundance of dependable
libraries. There are packages for scientific computing (Numpy and Scipy), machine learning
(Scikit-learn), image processing (Scikit-image), computer vision (OpenCV), data visualization
(Matplotlib) and deep learning (Pytorch, Tensorflow, MxNet, etc.). Most of these libraries can
be installed with a single **pip** command, which makes setup very easy.

## Jupyter Notebook

Jupyter divides Python code into neat units called a cell. There are two types of cell!
- **Markup cells** contain text, like this cell. Double click them to edit the text, and then press **< Shift > + < Enter >** to render it with formatting. Try double-clicking this text!
- **Code cells** contain executable python code. Click them to edit the code, then press **< Shift > + < Enter >** to execute the block. Try it out below!


In [1]:
print("Hi, I'm a code cell! Click me and press shift + enter.")

Hi, I'm a code cell! Click me and press shift + enter.


You can add new cells using the plus sign on the menu above. Take a few minutes to look at the different options offered in the menu! There are a few that are especially useful: 

- File Menu 
    - "Download As" lets you save your notebook to your computer as a .ipynb file
- Kernel Menu
    - "Restart the kernel" clears the output of every cell. Can be useful if your code gets stuck!
- Cell Menu
    - "Cell Type" lets you change the type of cell you are working with
    
## Python Refresher
First, let's do some exercises to refresh our memory of a few Python concepts.
### Python objects, basic types, and variables
Everything in Python is an **object** and every object in Python has a **type**. Some of the basic types include:

- **`int`** (integer; a whole number with no decimal place)
- **`float`** (float; a number that has a decimal place)
- **`str`** (string; a sequence of characters enclosed in single quotes, double quotes, or triple quotes)
- **`bool`** (boolean; a binary value that is either 'True' or 'False')
- **`NoneType`** (a special type representing the absence of a value)

In Python, a **variable** is a name you specify in your code that represents a specific **instance** of an object

Defining variable names helps you remember what an object is supposed to represent or what you want to do with that object (so pick meaningful names!). Variables also allow you a lot of flexibility, letting you modify their value without having to know exactly what the new value should be. You'll see this later. 
<hr>



In [2]:
# Take a guess!
some_num1 = 1
some_num2 = 4
(some_num1 + some_num2) * some_num2

20

20

In [3]:
# What type of variable will the result be?
some_num1 + some_num2 == 5

int

True

In [None]:
# What might this do?
simple_string1 = 'an example '
simple_string2 = "of strings "
simple_string1 + simple_string2

print an example of strings

In [None]:
# Important! Notice that the string was not modified
simple_string1

In [4]:
# Are these two expressions equal to each other?
simple_string1 == simple_string2

no

NameError: name 'simple_string1' is not defined

In [None]:
# Add and re-assign
simple_string1 += 'that re-assigned the original string'
simple_string1

### Basic containers

> Note: **mutable** objects can be modified after creation and **immutable** objects cannot.

Containers are objects that can be used to group other objects together. Some useful container types are:

- **`str`** (string: immutable; indexed by integers; items are stored in the order they were added)
- **`list`** (list: mutable; indexed by integers; items are stored in the order they were added)
  - `[3, 5, 6, 3, 'dog', 'cat', False]`
- **`tuple`** (tuple: immutable; indexed by integers; items are stored in the order they were added)
  - `(3, 5, 6, 3, 'dog', 'cat', False)`
- **`set`** (set: mutable; not indexed at all; items are NOT stored in the order they were added; can only contain immutable objects; does NOT contain duplicate objects)
  - `{3, 5, 6, 3, 'dog', 'cat', False}`
- **`dict`** (dictionary: mutable; key-value pairs are indexed by immutable keys; items are NOT stored in the order they were added)
  - `{'name': 'Jane', 'age': 23, 'fav_foods': ['pizza', 'fruit', 'fish']}`

When defining lists, tuples, or sets, use commas (,) to separate the individual items. When defining dicts, use a colon (:) to separate keys from values and commas (,) to separate the key-value pairs.

Strings, lists, and tuples can use all the `+`, `*`, `+=`, and `*=` operators. 

You can modify items in lists and tuples by using the index of the value. 
- list[0] is the first value in a list (programming basically always starts at 0 index!). 

You can modify items in a dictionary by using the key for that item:
- dict[key] is the dictionary item with key "key"

In [None]:
# Assign some containers to different variables
list1 = [99, "Bottles ", "of pop ", "on the wall", "pepsi"]
tuple1 = (99, "Bottles ", "of pop")
dict1 = {'Number of bottles of pop on the wall': 98}

In [None]:
# Items in the list object are stored in the order they were added
list1

In [None]:
# Items in the tuple object are stored in the order they were added
tuple1

In [None]:
# Items in the dict object are not stored in the order they were added
dict1

In [None]:
# You can change a list item
list1[0] = 98
list1[1] = 200
# But you CAN'T change a tuple item

In [None]:
# Re-assign a dict item
dict1["Number of bottles of pop on the wall"] = 96
dict1["Number of bottles of pop on the wall"] = 20

### Functions
A Python function is written like this:

```python
def square(x):
    return x**2
```
    
The name of the function is square, x is the input variable, and the return keyword tells us what to give as output.

In [5]:
# Exercise 1. Define a function called weight_conversion,that takes one variable (x) in pounds
# uses the conversion formula Kilograms = pounds/2.2 and returns the weight in kilograms.

#### YOUR CODE STARTS HERE ####
def weight_conversion(x):
    return x/2.2
#### YOUR CODE ENDS HERE ####

print("Testing:")
for x in [10, 22, 180, 0]:
    print(str(x), " -> ", str(weight_conversion(x)))
    print("CORRECT" if weight_conversion(x)==(x/2.2) else "INCORRECT")

Testing:
10  ->  4.545454545454545
CORRECT
22  ->  10.0
CORRECT
180  ->  81.81818181818181
CORRECT
0  ->  0.0
CORRECT


### If-else statements

An if/else statement looks like this:

```
if electoral_votes >= 270:
    print("You win the election")
else:
    print("You lose the election")
```

The if-statement is evaluated (`electoral_votes >= 270`); if it's true then the code under the `if` is executed, if it's false then the code under the `else` is executed.

In [8]:
# Exercise 2. Define a function called "contains_ss" that takes one variable (word) 
# and returns True if the word contains a double-s and False if it doesn't.
# Hint: to test whether a string e.g. "ss" is inside another string variable e.g. word, write
#    if "ss" in word: 

#### YOUR CODE STARTS HERE ####
def contains_ss(word):
    if "ss" in word:
        return True
    else:
        return False

#### YOUR CODE ENDS HERE ####

print("Testing:")
for word in ["computer", "science", "lesson"]:
    print("{:s} ->".format(word, contains_ss(word)), end=' ')
    print("CORRECT" if contains_ss(word)==("ss" in word) else "INCORRECT")

Testing:
computer -> CORRECT
science -> CORRECT
lesson -> CORRECT


### More complex if-else statements

Maybe you want to check *several* conditions? You can use an if/elif/else statement.

```
if teamA_score > teamB_score:
    print("Team A wins")
elif teamA_score < teamB_score:
    print("Team B wins")
else:
    print("It's a tie!")
```

`elif` stands for "else if". In fact, the above code is just a neater way of writing this:
```
if teamA_score > teamB_score:
    print("Team A wins")
else:
    if teamA_score < teamB_score:
        print("Team B wins")
    else:
        print("It's a tie!")
```

You can have as many `elif` statments as you like. These are useful for when you want several options.

In [10]:
# Exercise 3. Define a function called "grade" that takes one input (score).
# If score >= 90, return the string "A"
# Otherwise, if score >= 80, return the string "B"
# Otherwise, if score >= 70, return the string "C"
# Otherwise, if score >= 60, return the string "D"
# Otherwise, if score >= 50, return the string "E"
# Otherwise, return the string "F"

#### YOUR CODE STARTS HERE ####
def grade(score):
    if score >= 90:
        return "A"
    elif score >= 80:
        return "B"
    elif score >= 70:
        return "C"
    elif score >= 60:
        return "D"
    elif score>= 50:
        return "E"
    else:
        return "F"

#### YOUR CODE ENDS HERE ####


print("Testing:")
for (score,g) in [(77,"C"),(80,"B"),(32,"F"),(100,"A"),(69,"D")]:
    print("{:d} -> {:s}".format(score, grade(score)), end=' ')
    print("CORRECT" if grade(score)==g else "INCORRECT")

Testing:
77 -> C CORRECT
80 -> B CORRECT
32 -> F CORRECT
100 -> A CORRECT
69 -> D CORRECT


### Loops

Loops are a useful tool that lets you reuse blocks of code without having to retype everything. There are several kinds of loops. The difference between them is what controls how many times they run: 
- **While** loops run as long as the condition at the top is True. That means they can run forever if you're not careful!
- **For** loops run for as many times as there are objects in the container at the top of the loop.

In [None]:
# An example of a while loop. What is the condition? What would happen if you didn't subtract one from index each time?
index = 5
while index > 0:
    print(index)
    index -= 1
    
the condition is while 5 is greater than 0 and if if you didnt subtract one then the loop would go on forever

In [None]:
# An example of a for loop. What is the collection?
my_list = [5, 4, 3, 2, 1]
for number in my_list:
    print(number)
    
    the collection is a list of numbers 5 4 3 2 1

A handy trick with loops is the range function. Say you want your loop to run 5 times, but you don't want to have to make a list with 5 numbers. You can do this:

In [None]:
# The syntax for the range function is range(start, end, increment). 
# If you only include one argument, the function will start at 0 and go until that number, increasing by 1 each time. 
for number in range(5, 0, -1):
    print(number)

### Python built-in functions

A **function** is a Python object that you can "call" to **perform an action** or compute and **return another object**. You call a function by placing parentheses to the right of the function name. Some functions allow you to pass **arguments** inside the parentheses (separating multiple arguments with a comma). Internal to the function, these arguments are treated like variables.

Python has several useful built-in functions to help you work with different objects and/or your environment. Here is a small sample of them:

- **`type(obj)`** to determine the type of an object
- **`len(container)`** to determine how many items are in a container
- **`sorted(container)`** to return a new list from a container, with the items sorted
- **`sum(container)`** to compute the sum of a container of numbers
- **`min(container)`** to determine the smallest item in a container
- **`max(container)`** to determine the largest item in a container
- **`abs(number)`** to determine the absolute value of a number
- **`repr(obj)`** to return a string representation of an object

> Complete list of built-in functions: https://docs.python.org/3/library/functions.html


In [11]:
# Try out a few of the functions! See if you can predict the output for each one. 

type(5)

int

## Importing modules

- Modules are pre-packaged groups of Python files that you can import
- After importing a module, you can use its functions without having to write them yourself
- Here is a simple example of importing and using a module called numpy, which provides support for lots of different calculations and mathematical data structures:

In [12]:
# This line does the importing!
# The format is: "import (packagename) as (name you want to use for package in code).
# If you leave the second part out, it will be called by its original name
import numpy as np

# Here is an example of a numpy ndarray, a really useful data structure to understand. It is an array of any dimensions,
# Usually used to hold numbers. Numpy provides a lot of useful mathematical tools to work on arrays. 

ex_array = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(type(ex_array))            # Prints "<class 'numpy.ndarray'>"
print(ex_array.shape)                     # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"
ex_array[0,0] = 3
print(ex_array)

<class 'numpy.ndarray'>
(2, 3)


NameError: name 'b' is not defined

- To learn about what different functions or objects a module contains, you have to consult its documentation. For example, here is a link to the numpy documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/
- Try picking a new function from the documentation and applying it to ex_array. What kinds of interesting things can you do?

In [None]:
# Your code

# Write a rule-based text classifier
Time to write a rule-based classifier!
The function outline below uses a `if/elif/else` statement to return the predicted category of a text.

Fill in the missing `if` and `elif` statements with something sensible (there is no one right answer)!

Start with something simple; we'll build it into something more complicated later.

In [13]:
def classify_rb(text):
  text = str(text).lower() # this makes the text lower-case, so we don't have to worry about matching case

  if _"doctor" in text_______:
    return "Medical"
  elif _"athlete" in text_______:
    return "Sports"
  elif _"computer" in text_______:
    return "Technology"
  else:
    return "None"

SyntaxError: invalid syntax (<ipython-input-13-41627a663857>, line 4)

## Test your rule-based algorithm
Use following texts to test your algorithm:
- Medical: "Through the Mayo Clinic and ASU Alliance for Health Care Faculty Summer Residency Program, six professors from the College of Health Solutions and Ira A. Fulton Schools of Engineering will spend six weeks working side by side with Mayo Clinic researchers at a Mayo Clinic site in either Rochester, Minnesota, or locally in Phoenix or Scottsdale. The teams will collaborate on research that seeks to have a direct impact on patient outcomes and experiences. This year’s cohort is tackling questions relating to Alzheimer’s disease, Type 1 diabetes, liver disease and more."
- Sports: "Arizona State football has added former Cincinnati Bengals head coach Marvin Lewis as a special advisor to Herm Edwards’ staff. “Marvin Lewis is one of the most respected minds in our game,” Edwards said in a press release Tuesday. “Whether as the winningest coach in the franchise history of the Cincinnati Bengals, or the architect of one the greatest defenses in NFL history, the 2000 Baltimore Ravens, Marvin has succeeded everywhere he has been and he has done it the right way. His passion for teaching will be an incredible benefit not only for our coaches, but also for the young men we are responsible for as students and athletes.”"
- Technology: "With the advance of new printing and manufacturing technologies, heavy, rigid robots are being replaced with lightweight, flexible new systems. As humans increasingly share workspaces with robots, these softer versions offer a higher degree of safety and increased robotic capabilities. Soft, bio-inspired robots, which mimic animals like geckos, fish and octopuses — or humans — are able to adapt to solid, granular and fluid environments. The new materials used in the development of these robots are made possible by advanced printing and manufacturing techniques. According to Hamid Marvi, an ASU robotics professor whose research investigates the physics of biological systems, these developments will improve robotic support for search and rescue, medical applications and earth and space exploration."


In [19]:
def classify_rb(text):
    text = str(text).lower()

    if "health" in text:
        return "Medical"
    elif "football" in text:
        return "Sports"
    elif "robots" in text:
        return "Technology"
    else:
        return "None"

classify_rb("Through the Mayo Clinic and ASU Alliance for Health Care Faculty Summer Residency Program, six professors from the College of Health Solutions and Ira A. Fulton Schools of Engineering will spend six weeks working side by side with Mayo Clinic researchers at a Mayo Clinic site in either Rochester, Minnesota, or locally in Phoenix or Scottsdale. The teams will collaborate on research that seeks to have a direct impact on patient outcomes and experiences. This year’s cohort is tackling questions relating to Alzheimer’s disease, Type 1 diabetes, liver disease and more.")

'Medical'

## Break your rule-based classifier!

It's time to FOOL THE RULES!

You'll be deliberately trying to break each others' rule-based classifiers by writing tricky news that fool your neighbor's rule-based classifier. Once your own classifier has been fooled by a tricky text, it's your job to amend the rules in your classifier to account for the new case.

### Write a text about Medical that will be misclassified

Below, write a news about Medical that the classification function above will get wrong (i.e. fail to recognize it's about medical)

Hint: think of less-obvious medical-related keywords that aren't included in the rule-based system above.

Then run the cell - make sure the tweet is classified as something other than Medical!

In [21]:
medical_news = "the"
print("This text is classified as: {:s}\n".format(classify_rb(medical_news)))

This text is classified as: None



### Write a text about Sports that will be misclassified

In [22]:
sports = "he"
print("This text is classified as: {:s}\n".format(classify_rb(sports)))

This text is classified as: None



### Write a text about Technology that will be misclassified

In [23]:
technology = "are"
print("This text is classified as: {:s}\n".format(classify_rb(technology)))

This text is classified as: None



### Evaluation
What is your thoughts on evaluating your classifier?

In [None]:
# write your answer in comment, here:
# my classifier is not perfect because if there is the same word in many different paragraphs, then it does not detect it correctly

Modify your classify_rb function so it will classify the above examples correctly.

## More
**Exercise:** [The Cat and Mouse Game](http://tangra.cs.yale.edu/naclo/practice/2012A.html)

In [None]:
# Your code

# <font color='MEDIUMSEAGREEN'>(2) Loading Disaster Dataset</font>


Go to this [link](https://drive.google.com/drive/folders/11jbDZA4cZ4zt-WndeMQAgSss_LIxoHWe?usp=sharing) and download the folder called Disaster_Dataset

## Import the data

To load the data for your program, you are going to use Pandas Library.

In [67]:
# Write a piece of code that reads each line of text from each of the .txt files
# store the help tweets in help_tweets and not help tweets in nothelp_tweets
# use help_tweets[:10] to see the first 10 asking-for-help tweets
# Hint: use two lists help_tweets and nothelp_tweets. Use a for loop to store the tweets in each line of the .txt files in the corresponding list.

help_tweets = []
nothelp_tweets = []
import pandas as pd
for line in open("help.txt", 'r', encoding = 'utf-8'):
        help_tweets.append(line)
for line in open("nothelp.txt", 'r', encoding = 'utf-8'):
        nothelp_tweets.append(line)
    
    
print("these are the tweets that need help", help_tweets)
print()
print("these are the tweets that don't need help", nothelp_tweets)

these are the tweets that need help ["Haven't seen my brother in two days :/ #HurricaneSandy\n", 'Shall I go for a swim in our basement? #Sandy http://t.co/9LteRUC7\n', 'Will be working relief for hurricane Sandy for the foreseeable future. If you can help with a donation plead http://t.co/pR57f92T\n', '@benveekay please check 170k sea pine drive it the blue house second house in, bay side left thank you stay safe! #lbi #sandy #lbirecovery\n', 'My block. Reason we have #nopower #hurricanesandy http://t.co/lZroP316\n', 'Petrides hs taking donations until 8pm water, clothes, food come help those in need #HurricaneSandy #donations #si\n', '#fema battery #tunnel still closed #thoasands w/o #power #finances are #decimated. Where is the #aid to #sandy #victims\n', 'Thanks to lack of financial relief , hurricane sandy has forced me to cash in savings bonds to pay utilities. SAD! Thanks Fema\n', 'To make a donation to victims of Hurricane Sandy please do so here... http://t.co/INZa6SYE\n', 'I 

In [68]:
# Write a piece of code to find the number of tweets in each file
tweets = []
for line in open("help.txt", 'r', encoding = 'utf-8'):
    tweets.append(line)
print(len(tweets))

221


## Quiz

In [81]:
# Find all the tweets in which the word "weekend" apears and write them to a file
# Hint: use the following two lines to print to a file
# fout = open('weekend_tweets.txt', 'w')
# fout.write(some_text)
# How many did you find?
fout = open('weekend_tweets.txt', 'w',  encoding = 'utf-8')

weekend = []
for line in help_tweets:
    if "weekend" in line:
        fout.write(line)
        weekend.append(line)

for line in nothelp_tweets:
    if "weekend" in line:
        fout.write(line)
        weekend.append(line)

for weekend_line in open("weekend_tweets.txt", 'r', encoding = 'utf-8'):
    print(weekend_line)




next Friday is my 21st Birthday, but Thanks to #HurricaneSandy, it ruined my weekend and #Sandy ruined the place that I was spendin it #AC

This weekend I plan to get out there and help the people and animals who were left with nothing #StatenIslandDestruction #Sandy #NYC 

Comedian @louisck, hot off SNL this weekend, to headline #StatenIsland #Sandy benefit at St. George Theater. So cool. http://t.co/nOsR70sc

Passing Lady Liberty on the ferry to Staten Island. Will be delivering aid w/@AmeriCares all weekend. #Sandy http://t.co/XVpBwRgT

@LisaDanu @macsnorky Getting better slowly. I haven’t had to get since before #Sandy but will need to by this weekend - both cars.

its really disappointing about this city's choices with the NYC Marathon this weekend.... #sandy #ny1 #statenisland #theworstofstatenisland

@gregpomes @antderosa they recently restored weekend service thanks to @NMalliotakis but not sure it runs due to #sandy #newdorpbeach

@RWtish Hope things get better!!!! Ran outside