# UFOs!!!

The objective of this homework is to practice cleaning and transforming data. To successfully complete this homework, you may use any resources available to you. 

Get the `scrubbed.csv` data [here](https://www.kaggle.com/NUFORC/ufo-sightings/data). Develop **three** interesting insights into the UFO phenomenon. 
1. When and where do people see UFOs in California? 
2. Are there differences in the circumstances of UFO sightings across the U.S. states? (Explore the comment column)
3. What is the average length of UFO sightings across the U.S. states?

Hints:
* When you explore the comment column, think about the work we did on twitter. 
* For better performance, download the data to your local drive.
* Make reasonable assumptions when transforming the length data.

In [93]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [94]:
import pandas as pd
import numpy as np

In [127]:
data = pd.read_csv('/Users/celiaguan/Downloads/scrubbed.csv', low_memory=False, index_col=0)

In [306]:
data.head()

Unnamed: 0_level_0,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
10/10/49 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/04,29.8830556,-97.941111
10/10/49 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/05,29.38421,-98.581082
10/10/55 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/08,53.2,-2.916667
10/10/56 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/04,28.9783333,-96.645833
10/10/60 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/04,21.4180556,-157.803611


### 1. When and where do people see UFOs in California?

In [190]:
# time and location people see UFOs in California
ufo_ca = data[data['state']=='ca']['city']
ufo_ca.head()

datetime
10/10/68 13:00    hawthorne
10/10/79 22:00    san diego
10/10/89 0:00     calabasas
10/10/95 22:40      oakland
10/10/98 2:30     hollywood
Name: city, dtype: object

In [310]:
# how many times people saw UFO in a city in CA at a specific time
data[data['state']=='ca'].groupby(['datetime','city'])['city'].count().head()

datetime      city         
1/1/00 0:03   los angeles      1
1/1/00 0:23   san diego        1
1/1/00 16:20  san francisco    1
1/1/00 5:00   clovis           1
1/1/01 0:02   woodland         1
Name: city, dtype: int64

In [92]:
#data[(data['country']=='us') & (data['state']=='ca')]['city']

In [204]:
#data[data['country'].isnull()][['city','state','country']]

### 2. Are there differences in the circumstances of UFO sightings across the U.S. states?

In [104]:
# list of US states
us_states=data[data['country']=='us']['state'].unique()

In [211]:
# aggregate all comments in a state to a list
comments = data[data['state'].isin(us_states)][['state', 'comments']].groupby('state').agg(lambda x: x.tolist())
comments.head()

Unnamed: 0_level_0,comments
state,Unnamed: 1_level_1
ak,[It looked like a star but it would move and a...
al,[Strobe Lighted disk shape object observed clo...
ar,[Round&#44 bright&#44 low flying object silent...
az,[A small dark purple quad-thruster craft hover...
ca,[ROUND &#44 ORANGE &#44 WITH WHAT I WOULD SAY ...


In [212]:
# turn the list into strings
comments['comments'] = comments['comments'].apply(lambda x: ' '.join(str(s) for s in x))
comments.head()

Unnamed: 0_level_0,comments
state,Unnamed: 1_level_1
ak,It looked like a star but it would move and af...
al,Strobe Lighted disk shape object observed clos...
ar,Round&#44 bright&#44 low flying object silentl...
az,A small dark purple quad-thruster craft hoveri...
ca,ROUND &#44 ORANGE &#44 WITH WHAT I WOULD SAY W...


In [292]:
import nltk
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [293]:
stop = set(stopwords.words('english'))

In [294]:
# tokenize comments into single words
def tokenize(text):
    try:
        # remove all punctuations and numbers
        punc_num = re.compile('['+re.escape(string.punctuation)+'0-9\\r\\t\\n]')
        # substitute punctuations and numbers with space 
        text = punc_num.sub(' ', text)
        
        # tokenize words
        word = word_tokenize(text)
        # filter out stop words
        word = list(filter(lambda x: x.lower() not in stop, word))
        # filter out those words that lenth is shorter than 2
        word = [w.lower() for w in word if len(w)>2]
        
        return word
    except TypeError as e: print(text,e)

In [295]:
cm = comments.copy()

In [296]:
cm['comments']=cm['comments'].apply(tokenize)
cm.head()

Unnamed: 0_level_0,comments
state,Unnamed: 1_level_1
ak,"[looked, like, star, would, move, seconds, red..."
al,"[strobe, lighted, disk, shape, object, observe..."
ar,"[round, bright, low, flying, object, silently,..."
az,"[small, dark, purple, quad, thruster, craft, h..."
ca,"[round, orange, would, say, polished, metal, k..."


In [297]:
# analyze sentiment for comments and calculate the ratio of sentiments words against all the words 
sid = SentimentIntensityAnalyzer()
def pos(text):
    positive = 0
    try:
        for word in text:
            if (sid.polarity_scores(word)['compound']) >= 0.5:
                positive +=1
        return positive/len(text)*100
    except TypeError as e: print(text,e)          

In [298]:
def neg(text):
    negative = 0
    try:
        for word in text:
            if (sid.polarity_scores(word)['compound']) <= -0.5:
                negative +=1
        return negative/len(text)*100
    except TypeError as e: print(text,e)

In [299]:
cm['positive %']=cm['comments'].apply(pos)
cm['negative %']=cm['comments'].apply(neg)
cm['sentiment ratio']=cm['positive %']/cm['negative %']
cm.head()

Unnamed: 0_level_0,comments,positive %,negative %,sentiment ratio
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ak,"[looked, like, star, would, move, seconds, red...",0.130378,0.065189,2.0
al,"[strobe, lighted, disk, shape, object, observe...",0.264988,0.115932,2.285714
ar,"[round, bright, low, flying, object, silently,...",0.201918,0.084133,2.4
az,"[small, dark, purple, quad, thruster, craft, h...",0.272236,0.059552,4.571429
ca,"[round, orange, would, say, polished, metal, k...",0.326383,0.065048,5.017544


In [312]:
cm.sort_values('positive %', ascending=False).head(5)

Unnamed: 0_level_0,comments,positive %,negative %,sentiment ratio
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mt,"[floating, light, like, spotlight, friend, out...",0.64906,0.15667,4.142857
mn,"[traveling, northbound, state, highway, approx...",0.555971,0.064151,8.666667
de,"[moon, sized, object, quot, grew, quot, fill, ...",0.449006,0.0,inf
ne,"[object, moving, erratically, sky, stopped, sp...",0.444313,0.059242,7.5
wv,"[solid, round, silver, ball, passing, msl, sma...",0.443511,0.023343,19.0


### The top five states that describe UFOs with positive words are MONTANA, MINNESOTA, DELAWARE, NEBRASKA, WEST VIRGINIA

In [313]:
cm.sort_values('negative %', ascending=False).head(5)

Unnamed: 0_level_0,comments,positive %,negative %,sentiment ratio
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ri,"[bright, oval, object, sky, watching, sky, lik...",0.198098,0.198098,1.0
ky,"[slow, moving, silent, craft, accelerated, unb...",0.246433,0.168612,1.461538
nd,"[missing, time, obj, lights, triangular, shape...",0.246914,0.164609,1.5
mt,"[floating, light, like, spotlight, friend, out...",0.64906,0.15667,4.142857
md,"[freinds, familes, see, ufo, bright, light, mo...",0.387742,0.150094,2.583333


### The top five states that describe UFOs with negative words are RHODE ISLAND, KENTUCKY, NORTH DAKOTA, MONTANA, MARYLAND. MONTANA has been listed in both list. This suggests that people there have polarized opinion toward UFOs compared to other states.

In [315]:
cm.loc['mt']

comments           [floating, light, like, spotlight, friend, out...
positive %                                                   0.64906
negative %                                                   0.15667
sentiment ratio                                              4.14286
Name: mt, dtype: object

### Although MONTANA is ranked high on both list, the amount of positive comments is still 4 times larger than negative ones 

In [305]:
cm.sort_values('sentiment ratio', ascending=False).head()

Unnamed: 0_level_0,comments,positive %,negative %,sentiment ratio
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
wy,"[stationary, object, horizon, flashing, red, g...",0.328407,0.0,inf
de,"[moon, sized, object, quot, grew, quot, fill, ...",0.449006,0.0,inf
hi,"[marine, flying, fighter, attack, aircraft, so...",0.374065,0.0,inf
pr,"[went, saw, cigar, shaped, object, high, sun, ...",0.271739,0.0,inf
wv,"[solid, round, silver, ball, passing, msl, sma...",0.443511,0.023343,19.0


### There are four states that have total positive describtions on UFO and WEST VIRGINIA has 19 times (the highest) positive comments than negative ones.

In [303]:
cm.loc['ca']

comments           [round, orange, would, say, polished, metal, k...
positive %                                                  0.326383
negative %                                                 0.0650484
sentiment ratio                                              5.01754
Name: ca, dtype: object

### California people generally have postive comments on UFOs

### 3. What is the average length of UFO sightings across the U.S. states?

In [304]:
# find out all records that has state data but not country data in US
#data[(data['country'].isnull()) & (data['state'].isin(us_states))][['state','country']]

In [203]:
a =data[data['state'].isin(us_states)][['duration (seconds)','state']]
a['duration (seconds)']=pd.to_numeric(a['duration (seconds)'],errors='coerce')
a.groupby('state').mean()

Unnamed: 0_level_0,duration (seconds)
state,Unnamed: 1_level_1
ak,4231.830508
al,1393.408828
ar,100867.138889
az,5949.009338
ca,3928.781072
co,3024.394751
ct,13089.214928
dc,1161.224545
de,868.904372
fl,13504.459262


# Reference
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html

https://docs.python.org/2/howto/regex.html


