Part 2 Parse + Aquire

ACQUIRE: Obtain the data
Ideal Data vs. Available Data Often times we start by identifying the ideal data we would want for a project.

During the data acquisition phase, we'll learn about what data is available and any limitations it may have. We'll decide if these limitations will inhibit our ability to answer our question or if we can work with what we have to find a reasonable and reliable answer.

Some typical questions at this stage may include:

Identifying the right data set(s)
Is there enough data?
Does it appropriately align with the question/problem statement?
Can the dataset be trusted? How was it collected?
Is this dataset aggregated? Can we use the aggregation or do we need to get it pre-aggregation?
Assess resources, requirements, assumptions, and constraints
Further, we'll need to acquire the data by:

Importing data from the web (Google Analytics, HTML, XML)
Importing data from a file (CSV, XML, TXT, JSON)
Importing data from a preexisting database (SQL)
Setting up local or remote data structure
Determining most appropriate tools to work with data (following the format and size of data)


PARSE: Understand the data
Many times we are given secondary data, or data that was previously collected. In these cases, we have to learn as much as possible about our data using tools like data dictionaries and source documentation to determine exactly how this data was gathered.

Check: Why might it be important to understand how data was collected?

A data dictionary is exactly what it sounds like - it's a set of documentation that explains what our data is and how it is formatted. Here is an example:

Variable | Description | Type of Variable
---| ---| ---
Profession | Title of the account owner | Categorical
Company Size | 1- small, 2- medium, 3- large| Categorical
Location | Planet of the company | Categorical
Days Since Last Delivery | Integer | Continuous
Number of Deliveries | Integer | Continuous

Common Tasks at this step include:

Reading any documentation provided with the data (e.g. data dictionary above)
Performing exploratory surface analysis via filtering, sorting, and simple visualizations
Describing data structure and the information being collected
Exploring variables, data types via select
Assessing preliminary outliers, trends
Verifying the quality of the data (feedback loop -> 1)

### **[Capstone, Part 2: Dataset + Data Collection](./part-02/readme.md)**

Use your newfound skills to source and collect the relevant data for your project. Data acquisition, transformation, and cleaning are typically the most time-consuming parts of data science projects, so don’t procrastinate!

- **Requirements**: Source and format the data for your project. Perform preliminary data munging and cleaning of the data relevant to your project goals.  Describe your data keeping the intended audience of your final report in mind.
- **Format:** Table, file, or database with relevant text file or notebook description.
- **Due:** End of week 8


In [None]:
Data for modelling 
Data for Prediction
Data Dictionary of both data sets

# Data for Model
## Extracting Training data from ABSA xml
- I was able to find labeled training data for a sentiment evaluation of restaurant reviews from meta-share, a language data resource.

http://metashare.ilsp.gr:8080/repository/browse/semeval-2015-absa-restaurant-reviews-train-data/b2ac9c0c198511e4a109842b2b6a04d751e6725f2ab847df88b19ea22cb5cc4a/

In [9]:
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET

xml_path = './NLP/ABSA15_RestaurantsTrain2/ABSA-15_Restaurants_Train_Final.xml'

def parse_data_2015(xml_path):
    container = []                                              
    reviews = ET.parse(xml_path).getroot()                      
    
    for review in reviews:  
        sentences = review.getchildren()[0].getchildren()       
        for sentence in sentences:                                  
            sentence_text = sentence.getchildren()[0].text          
            
            try:                                                     
                opinions = sentence.getchildren()[1].getchildren()
            
                for opinion in opinions:                                
                    polarity = opinion.attrib["polarity"]
                    target = opinion.attrib["target"]
        
                    row = {"sentence": sentence_text, "sentiment":polarity}   
                    container.append(row)                                                              
                
            except IndexError: 
                row = {"sentence": sentence_text}        
                container.append(row)                                                               
                
    return pd.DataFrame(container)

ABSA_df = parse_data_2015(xml_path)
ABSA_df.head()

  if sys.path[0] == '':
  


Unnamed: 0,sentence,sentiment
0,Judging from previous posts this used to be a ...,negative
1,"We, there were four of us, arrived at noon - t...",negative
2,"They never brought us complimentary noodles, i...",negative
3,The food was lousy - too sweet or too salty an...,negative
4,The food was lousy - too sweet or too salty an...,negative


# Foursquare Data

### Obtaining requests and json files for each category

In [151]:
category_name = cat.level2_name.unique()
n = len(category_name)

In [160]:
url = 'https://api.foursquare.com/v2/venues/explore'

data = []
for cat in category_name:
    sleep(0.5)
    while n > 0:
        print n,
        n = n - 1
    params = dict(client_id= CLIENT_ID, client_secret= CLIENT_SECRET, v='20170801',
                  ll='51.5074,0.1278', query=cat , limit=3)
    resp = requests.get(url=url, params=params)
    data.append(json.loads(resp.text))

In [172]:
len(data)

424

### List of items from each category

In [192]:
results = []
# for each item from the categories in data variable
for d in data:
    for item in d['response']['groups'][0]['items']:
        try: 
            results.append(item)
        except:
            results.append('None')

In [193]:
# entries
# list of results containing a json of each venue
print len(results)

1217


### Variables

In [293]:
# Foursquare Datavenue_name = []

venue_id = []
for item in results:
    venue_name.append(item['venue']['name'])
    venue_id.append(item['venue']['id'])
print len(venue_name)
print len(venue_id)


rating = []
for item in results:
    try :
        rating.append(item['venue']['rating'])
    except:
        rating.append('None')   
print len(rating)


tips = []
for d in data:
    for item in d['response']['groups'][0]['items']:
        try:
            tips.append(item['tips'][0]['text'])
        except:
            tips.append('None')
print len(tips)


cat_id = []
cat_name = []
for item in results:
    for i in item['venue']['categories']:
        try: 
            cat_name.append(i['name'])
            cat_id.append(i['id'])
        except:
            cat_id.append('None')
            cat_name.append('None')  
print len(cat_name)
print len(cat_id)


ll = []
for item in results:
    try:
        ll.append(str(item['venue']['location']['lat']) + ',' + str(item['venue']['location']['lng']))
    except:
        ll.append('None')
print len(ll)




1217
1217
1217
1217
1217
1217
1217


In [295]:
cols = ['venue_name', 'venue_id','rating','tips', 'cat_name','cat_id','ll']
venue_dict = {'venue_name': venue_name, 'venue_id':venue_id,'rating':rating, 'tips':tips, 'cat_name':cat_name,'cat_id':cat_id,'ll':ll}

venues = pd.DataFrame(venue_dict, columns = cols)
venues

Unnamed: 0,venue_name,venue_id,rating,tips,cat_name,cat_id,ll
0,Rish Mix,56b39c20498ea8fecd563d3d,,none,Amphitheater,56aa371be4b08b9a8d5734db,"51.523882,-0.068887"
1,Garden Marquee,56e3505d498e40af95d08042,,none,Amphitheater,56aa371be4b08b9a8d5734db,"51.526064,-0.13527"
2,Horniman Museum and Gardens,4ac518d2f964a52045a720e3,9.3,Great fun to be had by everyone. The aquarium ...,Aquarium,4fceea171983d5d06c3e9823,"51.4409815123,-0.0613689422607"
3,Castle Aquatics,4e3d4fe4483b04e17a93bad1,,Love this place my new local shop,Aquarium,4fceea171983d5d06c3e9823,"51.4669013,0.0528256"
4,Sea Life London Aquarium,4bdd45a2645e0f47f3346b19,7.5,Enter our prize draw to win a family ticket to...,Aquarium,4fceea171983d5d06c3e9823,"51.501711493,-0.119767368051"
5,Four Quarters East,589cc36704f4d750f0c06ea0,8.4,"Great place for beer and video games, both arc...",Arcade,4bf58dd8d48988d1e1931735,"51.5469033096,-0.0243907579104"
6,Namco Funscape,4d31c11ceefa8cfa20d22eb3,6.2,Free foosball in the bar!,Arcade,4bf58dd8d48988d1e1931735,"51.575568781,0.179667461709"
7,MFA Bowl,4be586c22457a5935d8fab15,7.2,Including STATE OF THE ART arcade,Bowling Alley,4bf58dd8d48988d1e4931735,"51.4627149497,-0.0085972969013"
8,NOW Gallery,5454e529498ef0cd9c836e92,8.6,"Cool gallery, worth a visit",Art Gallery,4bf58dd8d48988d1e2931735,"51.500323233,0.00465396977695"
9,Painted Hall,4bbdb641a8cf76b0bf0bb2fd,8.6,Described as ‘probably the finest dining hall ...,Art Gallery,4bf58dd8d48988d1e2931735,"51.4829669049,-0.00609366277314"
