In [None]:
%%capture
!python -m spacy download en

In [None]:
from datascience import *
import spacy
from collections import Counter
import folium
import numpy as np
from IPython.display import HTML, display, IFrame
from scripts.hist_module import *
import requests
import time
import urllib
import matplotlib.pyplot as plt
%matplotlib inline 
plt.style.use("fivethirtyeight")

# Introduction to Importing Data, Using Tables and Creating Graphs 

## The Jupyter Notebook

First of all, note that this page is divided into what are called "cells". For example, the following cell is a "code cell" where you will write your code. You'll see a `In [ ]:` next to each cell for code, which is a counter for the cells you have run. You can navigate cells by clicking on them or by using the up and down arrows. Cells will be highlighted as you navigate them.

In [None]:
# this is a code cell

### Executing cells

<p></p>

<div class="alert alert-info">
You can execute cells with <b><code>Ctrl-Enter</code></b> (which will run the cell and keep the same cell selected), or <b><code>Shift-Enter</code></b> (which will run the cell and then select the next cell).
</div>

Try running the following cell and see what it prints out:

In [None]:
print("Hello world!")

## Creating Tables

### From Scratch

If we don't have a spreadsheet file and are starting with nothing, first we need to make arrays. Arrays are simply a form of list that a programming language uses to denote a collection of items. In the case of a table, we'll consider an array as either a row or a column. Let's make two arrays below that will become our columns, one for famous psychologists and one for the year they were born. 

In general, to make an array we use: 

```python
make_array(attribute_1, attribute_2, ...)
```

We set each of these created arrays equal to a variable name. This means that from now on, we can use that variable name to reference its respective array! Variables make information storage and retrieval much easier. 

In [None]:
psychologist_names = make_array("Freud", "Skinner", "Piaget", "Maslow")
psychologist_birth = make_array(1856, 1904, 1896, 1908)

Since we've assigned these to variables, all we have to do is call the variable name to get the information back, or to manipulate it!

In [None]:
psychologist_names

Now, to make a table using these arrays, we use the general form:

```python
Table( ).with_columns("Column Name", array_name, . . .)
```

We assign the created table to a variable (just like the arrays from above), and then type that variable name to display the table. 

In [None]:
psych_table= Table().with_columns("Psychologist", psychologist_names,
                                  "Birth Year", psychologist_birth)
psych_table.show()

### Importing

It's more likey that a file holding your data already exists. In general, to import data from a file, we write:

```python
Table.read_table("file_name")
```

Most often, these file names end in `.csv` to show the data format. `.csv` format is popular for spreadsheets and can be imported/exported from programs such as Microsoft Excel, OpenOffice Calc, or Google spreadsheets.

We've scraped some data on [job postings](http://careers.historians.org/jobs/?page=1) for historians from the American Historical Association:

In [None]:
job_data = Table.read_table('data/AHA-jobs.csv')

Let's `show` the first 5 rows:

In [None]:
job_data.show(5)

Python can calculate how large this table is with two functions: `num_rows` and `num_columns`. The general form for these functions are `table.num_rows` and `table.num_columns`. 

Let's use these on the table above. 

In [None]:
job_data.num_rows

In [None]:
job_data.num_columns

It looks like we have 144 job postings, and 14 different columns of metadata for each posting.

There are two methods to subset a table with select columns. We could either use the 'select' function or the 'drop' function. 

- `select` can create a new table with only the columns indicated in the parameters 
- `drop` can create a new table with columns NOT indicated in the parameters

In [None]:
job_data.select('title')

In [None]:
job_data.select('title', 'employer')

If we want to select only a specific subset of rows, we can use the `where` method. The general form of this function is:

```python
table_name.where(column_name, predicate)
```

Let's look only at job postings `where` the `primary_field` is `equal_to` "United States/North America":

In [None]:
job_data.where("primary_field", are.equal_to('United States/North America'))

---

### Tables Essentials!

For your reference, here's a table of useful `Table` functions:

|Name|Example|Purpose|
|-|-|-|
|`Table`|`Table()`|Create an empty table, usually to extend with data|
|`Table.read_table`|`Table.read_table("my_data.csv")`|Create a table from a data file|
|`with_columns`|`tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2))`|Create a copy of a table with more columns|
|`column`|`tbl.column("N")`|Create an array containing the elements of a column|
|`sort`|`tbl.sort("N")`|Create a copy of a table sorted by the values in a column|
|`where`|`tbl.where("N", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("N")`|Create a copy of a table with only some of the columns|
|`drop`|`tbl.drop("2*N")`|Create a copy of a table without some of the columns|
|`take`|`tbl.take(np.arange(0, 6, 2))`|Create a copy of the table with only the rows whose indices are in the given array|
|`join`|`tbl1.join("shared_column_name", tbl2)`|Join together two tables with a common column name
|`are.equal_to()`|`tbl.where("SEX", are.equal_to(0))`|find values equal to that indicated|
|`are.not_equal_to()`|`tbl.where("SEX", are.not_equal_to(0))` | find values not including the one indicated|
|`are.above()`| `tbl.where("AGE", are.above(30))` | find values greater to that indicated|
|`are.below()`| `tbl.where("AGE", are.below(40))` | find values less than that indicated |
|`are.between()`| `tbl.where("SEX", are.between(18, 60))` | find values between the two indicated |

---

## Visualizations 

Now that we have a manageable table we can start making visualizations! Due to the numerical nature of the census table above, let's first try a scatter plot. 

To create a scatter plot, we need to use the `scatter()` function. The general form is:

```python
table.scatter("column for x axis", "column for y axis")
```

An example is shown below:

In [None]:
job_data.group('primary_field')

In [None]:
job_data.group('primary_field').sort('count', descending=True)

In [None]:
job_data.group('primary_field').barh('primary_field')

In [None]:
job_data.group('employer').barh('employer')

In [None]:
job_data.group('type').barh('type')

In [None]:
job_data.group('employment_type').barh('employment_type')

In [None]:
chi_square_of_df_cols(job_data.to_df(), 'employment_type', 'primary_field')

---

## Text Analysis

In [None]:
nlp = spacy.load('en')

In [None]:
job_data['job_description'][0]

In [None]:
parsed_text = nlp(str(job_data['job_description'][0]))

In [None]:
len(parsed_text)

In [None]:
len(list(parsed_text.sents))

In [None]:
sents_tab = Table()
sents_tab.append_column(label="Sentence", values=[sentence.text for sentence in parsed_text.sents])
sents_tab.show()

In [None]:
toks_tab = Table()
toks_tab.append_column(label="Word", values=[word.text for word in parsed_text])
toks_tab.show()

In [None]:
toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
toks_tab.show()

In [None]:
toks_tab.append_column(label="Lemma", values=[word.lemma_ for word in parsed_text])
toks_tab.show()

In [None]:
def tablefy(parsed_text):
    toks_tab = Table()
    toks_tab.append_column(label="Word", values=[word.text for word in parsed_text])
    toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
    toks_tab.append_column(label="Lemma", values=[word.lemma_ for word in parsed_text])
    toks_tab.append_column(label="Stop Word", values=[word.is_stop for word in parsed_text])
    toks_tab.append_column(label="Punctuation", values=[word.is_punct for word in parsed_text])
    toks_tab.append_column(label="Space", values=[word.is_space for word in parsed_text])
    toks_tab.append_column(label="Number", values=[word.like_num for word in parsed_text])
    toks_tab.append_column(label="OOV", values=[word.is_oov for word in parsed_text])
    toks_tab.append_column(label="Dependency", values=[word.dep_ for word in parsed_text])
    return toks_tab

In [None]:
tablefy(parsed_text).show()

In [None]:
all_descriptions = tablefy(nlp(' '.join(job_data['job_description'])))
all_descriptions.show(100)

In [None]:
adjectives = all_descriptions.where('POS', are.equal_to('ADJ'))
adjectives.show(10)

In [None]:
Counter(adjectives['Word']).most_common()

In [None]:
adjectives = all_descriptions.where('POS', are.equal_to('ADJ')).where('Stop Word', are.equal_to(False))
Counter(adjectives['Word']).most_common()

In [None]:
Counter(adjectives['Word'])['digital']

In [None]:
adjectives = all_descriptions.where('POS', are.equal_to('NOUN')).where('Stop Word', are.equal_to(False))
Counter(adjectives['Word']).most_common()

## N-grams

# Challenge
Verbs

## Mapping

In [None]:
latitude = []
longitude = []
for i in range(job_data.num_rows):
    search = urllib.parse.quote(job_data['employer'][i])
    
    print(job_data['employer'][i])

    try:
        json_res = requests.get('https://maps.googleapis.com/maps/api/geocode/json?address={}'.format(search)).json()
        coordinates = json_res['results'][0]['geometry']['location']
        latitude.append(coordinates['lat'])
        longitude.append(coordinates['lng'])
    except:
        latitude.append('')
        longitude.append('')

    time.sleep(.5)

In [None]:
job_data = job_data.with_columns('latitude', latitude, 'longitude', longitude)
job_data.show(5)

In [None]:
color_dict, html_key = assign_colors(job_data.to_df(), "employment_type")
display(HTML(html_key))

In [None]:
mapa = folium.Map(location=[39.8333333,-98.585522], zoom_start=3) # Folium is a useful library for generating
                                                                   # Google maps-like map visualizations.
for r in job_data.rows:
    
    if r[-2] != '':
        folium.CircleMarker((float(r[-2]), float(r[-1])),
                    radius=1,
                    popup=r[10],
                    color=color_dict[r[10]],
                    fill_color=color_dict[r[10]],
                   ).add_to(mapa)

mapa.save("map1.html")
IFrame('map1.html', width=700, height=400)