# Using Twitter Decahose with Cavium

## Table of Contents
- [UM Hadoop Cavium Cluster](#um-hadoop-cavium-cluster)
  - [Setting Python Version](#setting-python-version)
- [PySpark Interactive Shell](#pyspark-interactive-shell)
  - [Exit Interactive Shell](#exit-interactive-shell)
- [Using Jupyter Notebook with PySpark](#using-jupyter-notebook-with-pyspark)
- [Example: Parsing JSON](#example-parsing-json)
  - [Read in twitter file](#read-in-twitter-file)
  - [Selecting Data](#selecting-data)
    - [Getting Nested Data](#getting-nested-data)
    - [Getting Nested Data II](#getting-nested-data-ii)
  - [Summary](#summary)
  - [Saving Data](#saving-data)
  - [Complete Script](#complete-script)
- [Example: Finding text in a Tweet](#example-finding-text-in-a-tweet)
- [Example: Filtering Tweets by Location](#example-filtering-tweets-by-location)
  - [Coordinates](#coordinates)
  - [Place](#place)
  - [Place Types](#place-types)

## UM Hadoop Cavium Cluster <a name='um-hadoop-cavium-cluster'></a>
Twitter data already resides in a directory on Cavium. Log in to Cavium to get started.

SSH to `cavium-thunderx.arc-ts.umich.edu` `Port 22` using a SSH client (e.g. PuTTY on Windows) and login using your Cavium account and two-factor authentication.

**Note:** ARC-TS has a [Getting Started with Hadoop User Guide](http://arc-ts.umich.edu/new-hadoop-user-guide/)

### Setting Python Version <a name='setting-python-version'></a>
Change Python version for PySpark to Python 3.X (instead of default Python 2.7) 

```
export PYSPARK_PYTHON=/bin/python3  
export PYSPARK_DRIVER_PYTHON=/bin/python3
```

## PySpark Interactive Shell <a name='pyspark-interactive-shell'></a>
The interactive shell is analogous to a python console. The following command starts up the interactive shell for PySpark with default settings in the `workshop` queue.  
`pyspark --master yarn --queue workshop`

The following line adds some custom settings.  The 'XXXX' should be a number between 4050 and 4099.  
`pyspark --master yarn --queue workshop --num-executors 500 --executor-memory 5g --conf spark.ui.port=XXXX`

**Note:** You might get a warning message that looks like `WARN Utils: Service 'SparkUI' could not bind on port 40XX. Attempting port 40YY.` This usually resolves itself after a few seconds. If not, try again at a later time.

The interactive shell does not start with a clean slate. It already has several objects defined for you. 
- `sc` is a SparkContext
- `sqlContext` is a SQLContext object
- `spark` is a SparkSession object

You can check this by typing the variable names.

### Exit Interactive Shell <a name='exit-interactive-shell'></a>
Type `exit()` or press Ctrl-D

## Using Jupyter Notebook with PySpark <a name='using-jupyter-notebook-with-pyspark'></a>
Currently, the Cavium configuration only supports Python 2.7 on Jupyter.

1. Open a command prompt/terminal in Windows/Mac. You should have putty in your PATH (for Windows).  Port 8889 is arbitrarily chosen.  
`putty.exe -ssh -L localhost:8889:localhost:8889 cavium-thunderx.arc-ts.umich.edu` (Windows)  
`ssh -L localhost:8889:localhost:8889 cavium-thunderx.arc-ts.umich.edu` (Mac/Linux)
2. This should open a ssh client for Cavium. Log in as usual.
3. From the Cavium terminal, type the following (replace XXXX with number between 4050 and 4099):

`export PYSPARK_PYTHON=/bin/python3  # not functional code`  
`export PYSPARK_DRIVER_PYTHON=jupyter`  
`export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8889'`  
`pyspark --master yarn --queue workshop --num-executors 500 --executor-memory 5g --conf spark.ui.port=XXXX`

4. Copy/paste the URL (from your terminal where you launched jupyter notebook) into your browser. The URL should look something like this but with a different token.
http://localhost:8889/?token=745f8234f6d0cf3b362404ba32ec7026cb6e5ea7cc960856
5. You should be connected.

In [1]:
sc

In [2]:
sqlContext

<pyspark.sql.context.SQLContext at 0x40004cfcbed0>

In [3]:
spark

Check Python version

In [32]:
import sys
sys.version

'2.7.5 (default, Oct 31 2018, 18:48:32) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]'

## Example: Parsing JSON <a name='example-parsing-json'></a>
Generic PySpark data wrangling commands can be found at https://github.com/caocscar/workshops/blob/master/pyspark/pyspark.md

## Read in twitter file <a name='read-in-twitter-file'></a>
The twitter data is stored in JSONLINES format and compressed using bz2. PySpark has a `sqlContext.read.json` function that can handle this for us (including the decompression).

In [3]:
import os
wdir = '/var/twitter/decahose/raw'
df = sqlContext.read.json(os.path.join(wdir,'decahose.2018-03-02.p2.bz2'))

This reads the JSONLINES data into a PySpark DataFrame. We can see the structure of the JSON data using the `printSchema` method.

In [7]:
df.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- display_text_range: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- additional_media_info: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- embeddable: boolean (nullable = true)
 |    |    |    |    |-- monetizable: bo

The schema shows the "root-level" attributes as columns of the dataframe. Any nested data is squashed into arrays of values (no keys included).

**Reference**
 - PySpark JSON Files Guide https://spark.apache.org/docs/latest/sql-data-sources-json.html

 - Twitter Tweet Objects https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html

### Selecting Data <a name='selecting-data'></a>
For example, if we wanted to see what the tweet text is and when it was created, we could do the following.

In [8]:
tweet = df.select('created_at','text')
tweet.printSchema()
tweet.show(5)

root
 |-- created_at: string (nullable = true)
 |-- text: string (nullable = true)

+--------------------+--------------------+
|          created_at|                text|
+--------------------+--------------------+
|Sat Mar 03 04:57:...|RT @nyorai_fgo: ア...|
|Sat Mar 03 04:57:...|絶対サンダイオー出るのやばいんだが...|
|Sat Mar 03 04:57:...|come hang out whi...|
|Sat Mar 03 04:57:...|RT @minstarcholee...|
|Sat Mar 03 04:57:...|RT @prswunews: 🙏...|
+--------------------+--------------------+
only showing top 5 rows



The output is truncated by default. We can override this using the truncate argument.

In [9]:
tweet.show(5, truncate=False)

+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|created_at                    |text                                                                                                                                            |
+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|Sat Mar 03 04:57:18 +0000 2018|RT @nyorai_fgo: アゲハ蝶で推しカプに涙するオタクは
「貴方に会えたそれだけで良かった世界に光が満ちた」系
「愛されたいと願ってしまった世界が表情を変えた」系
「貴方が望むのならこの身などいつでも差し出していい」系
「ラーラーラーラーーーラーーーーラーーーー（語彙…    |
|Sat Mar 03 04:57:18 +0000 2018|絶対サンダイオー出るのやばいんだがサッヴァーク当たったから良いものの                                                                                                              |
|Sat Mar 03 04:57:18 +0000 2018|come hang out while I strim some #Destiny https://t.co/l2ntwt5GT3             

#### Getting Nested Data <a name='getting-nested-data'></a>
What if we wanted to get at data that was nested? Like in `user`.

In [11]:
user = df.select('user')
user.printSchema()
user.show(1, truncate=False)

root
 |-- user: struct (nullable = true)
 |    |-- contributors_enabled: boolean (nullable = true)
 |    |-- created_at: string (nullable = true)
 |    |-- default_profile: boolean (nullable = true)
 |    |-- default_profile_image: boolean (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- favourites_count: long (nullable = true)
 |    |-- follow_request_sent: string (nullable = true)
 |    |-- followers_count: long (nullable = true)
 |    |-- following: string (nullable = true)
 |    |-- friends_count: long (nullable = true)
 |    |-- geo_enabled: boolean (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- id_str: string (nullable = true)
 |    |-- is_translator: boolean (nullable = true)
 |    |-- lang: string (nullable = true)
 |    |-- listed_count: long (nullable = true)
 |    |-- location: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- notifications: string (nullable = true)
 |    |-- profile_background_color: str

This returns a single column `user` with the nested data in a list (technically a `struct`).

We can select nested data using the `.` notation.

In [12]:
names = df.select('user.name','user.screen_name')
names.printSchema()
names.show(5)

root
 |-- name: string (nullable = true)
 |-- screen_name: string (nullable = true)

+----------+-----------+
|      name|screen_name|
+----------+-----------+
|        夜雲| ya_kumo229|
|      ぽけねこ|pokeneko867|
|Big Fletch|BigFletchWC|
|    m.byul|    mbyul_m|
|    DEAR12| Dearbabo12|
+----------+-----------+
only showing top 5 rows



To expand ALL the data into individual columns, you can use the `.*` notation.

In [13]:
allcolumns = df.select('user.*')
allcolumns.printSchema()
allcolumns.show(4)

root
 |-- contributors_enabled: boolean (nullable = true)
 |-- created_at: string (nullable = true)
 |-- default_profile: boolean (nullable = true)
 |-- default_profile_image: boolean (nullable = true)
 |-- description: string (nullable = true)
 |-- favourites_count: long (nullable = true)
 |-- follow_request_sent: string (nullable = true)
 |-- followers_count: long (nullable = true)
 |-- following: string (nullable = true)
 |-- friends_count: long (nullable = true)
 |-- geo_enabled: boolean (nullable = true)
 |-- id: long (nullable = true)
 |-- id_str: string (nullable = true)
 |-- is_translator: boolean (nullable = true)
 |-- lang: string (nullable = true)
 |-- listed_count: long (nullable = true)
 |-- location: string (nullable = true)
 |-- name: string (nullable = true)
 |-- notifications: string (nullable = true)
 |-- profile_background_color: string (nullable = true)
 |-- profile_background_image_url: string (nullable = true)
 |-- profile_background_image_url_https: string (nulla

Some nested data is stored in an `array` instead of `struct`.

In [14]:
arr = df.select('entities.user_mentions.name')
arr.printSchema()
arr.show(5)

root
 |-- name: array (nullable = true)
 |    |-- element: string (containsNull = true)

+--------------------+
|                name|
+--------------------+
|  [〆切から逃げるな3/18ア34b]|
|                  []|
|                  []|
|[🌙 #เกียมต้อนรับ...|
|               [มศว]|
+--------------------+
only showing top 5 rows



The data is stored in an `array` similar as before. We can use the `explode` function to extract the data from an `array`.

In [15]:
from pyspark.sql.functions import explode

arr2 = df.select(explode('entities.user_mentions.name'))
arr2.printSchema()
arr2.show(5)

root
 |-- col: string (nullable = true)

+-------------------+
|                col|
+-------------------+
|   〆切から逃げるな3/18ア34b|
|🌙 #เกียมต้อนรับผัว|
|                มศว|
| Dj Tannie Swiss 🎧|
|  paidamoyo marimbe|
+-------------------+
only showing top 5 rows



If we wanted multiple columns under user_mentions, we'd be tempted to use multiple `explode` statements as so.

In [16]:
df.select(explode('entities.user_mentions.name'), explode('entities.user_mentions.screen_name'))

AnalysisException: u'Only one generator allowed per select clause but found 2: explode(entities.user_mentions.name AS `name`), explode(entities.user_mentions.screen_name AS `screen_name`);'

This generates an error: *Only one generator allowed per select clause but found 2:*

We can get around this by using `explode` on the top most key with an `alias` and then selecting the columns of interest.

In [17]:
mentions = df.select(explode('entities.user_mentions').alias('user_mentions'))
mentions.printSchema()
mentions2 = mentions.select('user_mentions.name','user_mentions.screen_name')
mentions2.show(5)

root
 |-- user_mentions: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- id_str: string (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- name: string (nullable = true)
 |    |-- screen_name: string (nullable = true)

+-------------------+-------------+
|               name|  screen_name|
+-------------------+-------------+
|   〆切から逃げるな3/18ア34b|   nyorai_fgo|
|🌙 #เกียมต้อนรับผัว|minstarcholee|
|                มศว|    prswunews|
| Dj Tannie Swiss 🎧|   tmarimbe23|
|  paidamoyo marimbe|   PaidamoyoM|
+-------------------+-------------+
only showing top 5 rows



#### Getting Nested Data II <a name='getting-nested-data-ii'></a>
What if we wanted to get at data in a list? Like the indices in `user_mentions`.

In [18]:
idx = mentions.select('user_mentions.indices')
idx.printSchema()
idx.show(5)

root
 |-- indices: array (nullable = true)
 |    |-- element: long (containsNull = true)

+--------+
| indices|
+--------+
| [3, 14]|
| [3, 17]|
| [3, 13]|
| [5, 16]|
|[22, 33]|
+--------+
only showing top 5 rows



The schema shows that the data is in an `array` type. For some reason, `explode` will put each element in its own row. Instead, we can use the `withColumn` method to index the list elements.

In [35]:
idx2 = idx.withColumn('first', idx['indices'][0]).withColumn('second', idx['indices'][1])
idx2.show(5)

+--------+-----+------+
| indices|first|second|
+--------+-----+------+
| [3, 14]|    3|    14|
| [3, 17]|    3|    17|
| [3, 13]|    3|    13|
| [5, 16]|    5|    16|
|[22, 33]|   22|    33|
+--------+-----+------+
only showing top 5 rows



Why the difference?  Because the underlying element is not a `struct` data type but a `long` instead.

### Summary <a name='summary'></a>
So if you access JSON data in Python like this:

`(tweet['created_at'], tweet['user']['name'], tweet['user']['screen_name'], tweet['text'])`

The equivalent of a PySpark Dataframe would be like this:
`df.select('created_at','user.name','user.screen_name','text')`

### Saving Data <a name='saving-data'></a>
Once you have constructed your PySpark DataFrame of interest, you should save it (append or overwrite) as a parquet file as so.

In [22]:
folder = 'twitterExtract'
df.write.mode('overwrite').parquet(folder)

### Complete Script <a name='complete-script'></a>
Here is a sample script which combines everything we just covered. It extracts a four column DataFrame.

In [4]:
import os

wdir = '/var/twitter/decahose/raw'
file = 'decahose.2018-03-02.p2.bz2'
df = sqlContext.read.json(os.path.join(wdir,file))
six = df.select('created_at','user.name','user.screen_name','text','coordinates','place')
folder = 'twitterExtract'
six.write.mode('overwrite').parquet(folder)

## Example: Finding text in a Tweet <a name='example-finding-text-in-a-tweet'></a>
Read in parquet file.

In [1]:
folder = 'twitterDemo'
df = sqlContext.read.parquet(folder)

Below are several ways to match text
***

Exact match `==`

In [25]:
hello = df.filter(df.text == 'hello world')
hello.show(10)

+--------------------+---------------+--------------+-----------+
|          created_at|           name|   screen_name|       text|
+--------------------+---------------+--------------+-----------+
|Wed Jul 03 10:10:...|         shefty|shefty05026540|hello world|
|Tue Jul 02 14:46:...|     Fathur2911|    fathur2911|hello world|
|Fri Jul 05 14:47:...|keru robot mode|          keru|hello world|
|Tue Jul 02 05:42:...|        balmunc|      balmunc1|hello world|
|Wed Jul 03 04:53:...|  fanax20082006| fanax20082006|hello world|
|Mon Jul 01 05:44:...|           Leah|   leahjames24|hello world|
|Wed Jul 03 02:29:...|           Niño|      NyouNyii|hello world|
|Fri Jul 05 01:19:...|        やっぱり甲子園|        hsbbjp|hello world|
|Thu Jul 04 15:51:...|           ささけん|        KRPK_A|hello world|
|Wed Jul 03 02:57:...|     spirit.wan|     spiritwan|hello world|
+--------------------+---------------+--------------+-----------+
only showing top 10 rows



`contains` method

In [26]:
food = df.filter(df['text'].contains(' food'))
food = food.select('text')
food.show(10, truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                                       |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|RT @JIMINSPROMlSE: ‘focuses on her mouth’ it’s an advertisement for food what else is she supposed to do with it ?? play with it like a bar…               |
|RT @xesixt: bule di comment section street food ini bawel ya, “wear some fucking glove you fuck” bawel lu londo cebok kok pake tisu hhh                    |
|RT @nayelly_nails: Even if I get my food first, I will wait for you to get yours to start eating https://t.co/9zAlhknJXP                                   |
|i thoroughly examine my food alittle more extra now

`startswith` method

In [27]:
once = df.filter(df.text.startswith('Once'))
once = once.select('text')
once.show(10, truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Once zaleel always zaleel 
#fakharZaleel                                                                                                    |
|Once again my illiterate friends we dont take our husbands’ names here in Egypt, there’s no Mr and Mrs X, are you t… https://t.co/cYQ1yEJ6T6|
|Once you get to know about the particular hazards that occur at your workplace, then it will help you in reducing t… https://t.co/STDgBsX9Nu|
|Once you find ways to happily achieve what you want with what you have, then when you have abundance you know you're flying in colors. 😁   |


`endswith` method

In [28]:
ming = df.filter(df['text'].endswith('ming'))
ming = ming.select('text')
ming.show(10, truncate=False)

+----------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------+
|#StrangerThings3Premiere was underwhelming                                                                            |
|@GouldyGaming Love to see you videos 😊@GouldyGaming                                                                  |
|RT @vanjess: Divine timing                                                                                            |
|The amount of layers people in Paris wear in 90-degree weather is alarming                                            |
|https://t.co/t56KLqJxXo

I don’t know if you saw this, but yes, this, yes !!!!!
@idreamofcumming                      |
|I’m not 5sos but when is a new a

`like` method using SQL wildcards

In [29]:
mom = df.filter(df.text.like('%mom_'))
mom = mom.select('text')
mom.show(10, truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|【定期】ブログやってますのでよかったら来てやって下さい(*´∀｀)行ったイベントのレポ/ITなお話/気になる小ネタ/思うことなどなど頑張って書きます(*´ω｀*)b https://t.co/CZsRQcdsqf #mosmome                         |
|@iazs97 سناباتك تفتح النفس https://t.co/0QpEdLmomV                                                                                          |
|@ghsolowonda @johndumelo1 No please I'm on momo                                                                                             |
|I'm scared that the tickets would go on sale while I'll be at the camp.. also I have to ask my mom😭                                        |


regular expressions ([workshop material](https://github.com/caocscar/workshops/tree/master/regex))

In [34]:
regex = df.filter(df['text'].rlike('[ia ]king'))
regex = regex.select('text')
regex.show(10, truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|At least there wre some trees around. Love the light making circular patterns on branches in the winter. https://t.co/8niWN7vevH            |
|#HanggangMayMayWard  lalaban ako  at susuportahan ko sila sa abot ng aking makakaya                                                         |
|RT @VisitGraceland: Thank you, thank you very much to @ArgoMemphis for the incredible #Elvis-themed fireworks. The king would’ve loved this…|
|RT @Kevinfischer593: Here's a video of me taking ice cream and putting it in my cart and not licking it https://t.co/t5sq9al0DN             |

Applying more than one condition. When building DataFrame boolean expressions, use
- `&` for `and`
- `|` for `or`
- `~` for `not`  

In [31]:
resta = df.filter(df.text.contains('resta') & df.text.endswith('ing'))
resta = resta.select('text')
resta.show(10, truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Le Relais, un restaurant fertile - L'Hôtellerie Restauration https://t.co/suSRDWyBeU #paris #coworking                                      |
|RT @Amrut58731711: @Indiamining @PMOIndia @nstomar @nitin_gadkari @goacm @makeinindia #GoaMining #restartGoamining                          |
|@sam_carrington1 A birthday meal at a seafood restaurant I had a bad reaction to mussels, spent the night vomiting                          |
|RT @SHRIKRISHNA8484: @Indiamining Restart Goa mining operation
#200days  #restartgoamining  #goamining                                      |

**Reference**: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

## Example: Filtering Tweets by Location <a name="example-filtering-tweets-by-location"></a>

Read in parquet file.

In [67]:
folder = 'twitterDemo'
df = sqlContext.read.parquet(folder)

From the [Twitter Geo-Objects documentation](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/geo-objects):

> There are two "root-level" JSON objects used to  describe the location associated with a Tweet: `coordinates` and `place`.

> The `place` object is always present when a Tweet is geo-tagged, while the `coordinates` object is only present (non-null) when the Tweet is assigned an exact location. If an exact location is provided, the `coordinates` object will provide a [long, lat] array with the geographical coordinates, and a Twitter Place that corresponds to that location will be assigned.

### Coordinates <a name="coordinates"></a>
Select Tweets that have gps coordinates

In [68]:
coords = df.filter(df['coordinates'].isNotNull())

Construct a longitude and latitude column

In [69]:
coords = coords.withColumn('lng', coords['coordinates.coordinates'][0])
coords = coords.withColumn('lat', coords['coordinates.coordinates'][1])
coords.printSchema()
coords.show(5, truncate=False)

root
 |-- created_at: string (nullable = true)
 |-- name: string (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- text: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- place: struct (nullable = true)
 |    |-- bounding_box: struct (nullable = true)
 |    |    |-- coordinates: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |-- type: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- country_code: string (nullable = true)
 |    |-- full_name: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- place_type: string (nullable = true)
 |    |-- url: st

Apply a bounding box to tweets and count number of matching tweets

In [70]:
A2 = coords.filter(coords['lng'].between(-84,-83) & coords['lat'].between(42,43))
A2.show(5, truncate=False)
A2.count()

+------------------------------+------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+
|created_at                    |name              |screen_name   |text                                                                                                                                           |coordinates                                  |place                                                                                                                                                

652

### Place <a name="place"></a>
Search for places by name 

Create separate columns from `place` object

In [71]:
place = df.filter(df['place'].isNotNull())
place = place.select('place.country', 'place.country_code', 'place.place_type','place.name', 'place.full_name')
place.printSchema()
place.show(10, truncate=False)

root
 |-- country: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- place_type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- full_name: string (nullable = true)

+----------------------+------------+----------+----------------------+-----------------------+
|country               |country_code|place_type|name                  |full_name              |
+----------------------+------------+----------+----------------------+-----------------------+
|Portugal              |PT          |city      |Barcelos              |Barcelos, Portugal     |
|Brasil                |BR          |city      |São Luís              |São Luís, Brasil       |
|Malaysia              |MY          |city      |Petaling Jaya         |Petaling Jaya, Selangor|
|Uzbekistan            |UZ          |country   |Uzbekistan            |Uzbekistan             |
|Germany               |DE          |city      |Illmensee             |Illmensee, Deutschland |
|Ireland               |

Apply place filter

In [72]:
MI = place.filter(place['full_name'].contains(' MI'))
MI.show(10, truncate=False)

+-------------+------------+----------+-------------+-----------------+
|country      |country_code|place_type|name         |full_name        |
+-------------+------------+----------+-------------+-----------------+
|United States|US          |city      |Grand Rapids |Grand Rapids, MI |
|United States|US          |city      |Roseville    |Roseville, MI    |
|United States|US          |city      |Detroit      |Detroit, MI      |
|United States|US          |city      |Holt         |Holt, MI         |
|United States|US          |city      |Clinton      |Clinton, MI      |
|United States|US          |city      |Clinton      |Clinton, MI      |
|United States|US          |city      |Roseville    |Roseville, MI    |
|United States|US          |city      |Harrison     |Harrison, MI     |
|United States|US          |city      |Detroit Beach|Detroit Beach, MI|
|United States|US          |city      |Ferndale     |Ferndale, MI     |
+-------------+------------+----------+-------------+-----------

**Tip**: Refer to section ["Finding text in a Tweet"](#example-finding-text-in-a-tweet) for other search methods

### Place Types <a name="place-types"></a>
There are five kinds of `place_type` in the twitter dataset in approximately descending geographic area:
1. country
2. admin
3. city
4. neighborhood
5. poi (point of interest)

Here's a breakdown of the relative frequency for this dataset

In [103]:
place.registerTempTable('Places')
place_type_ct = sqlContext.sql('SELECT place_type, COUNT(*) as ct FROM Places GROUP BY place_type ORDER BY ct DESC')
place_type_ct.show()

+------------+-------+
|  place_type|     ct|
+------------+-------+
|        city|1738893|
|       admin| 221170|
|     country|  79811|
|         poi|  24701|
|neighborhood|   3343|
+------------+-------+



Here are some examples of each `place_type`:

#### Country

In [104]:
country = sqlContext.sql("SELECT * FROM Places WHERE place_type = 'country'")
country.show(5, truncate=False)

+-----------------------+------------+----------+-----------------------+-----------------------+
|country                |country_code|place_type|name                   |full_name              |
+-----------------------+------------+----------+-----------------------+-----------------------+
|Uzbekistan             |UZ          |country   |Uzbekistan             |Uzbekistan             |
|Bosnia and Herzegovina |BA          |country   |Bosnia and Herzegovina |Bosnia and Herzegovina |
|United States          |US          |country   |United States          |United States          |
|Ukraine                |UA          |country   |Ukraine                |Ukraine                |
|República de Moçambique|MZ          |country   |República de Moçambique|República de Moçambique|
+-----------------------+------------+----------+-----------------------+-----------------------+
only showing top 5 rows



#### Admin (US examples)

In [109]:
admin = sqlContext.sql("SELECT * FROM Places WHERE place_type = 'admin' AND country_code = 'US'")
admin.show(10, truncate=False)

+-------------+------------+----------+--------------+-------------------+
|country      |country_code|place_type|name          |full_name          |
+-------------+------------+----------+--------------+-------------------+
|United States|US          |admin     |Louisiana     |Louisiana, USA     |
|United States|US          |admin     |New York      |New York, USA      |
|United States|US          |admin     |California    |California, USA    |
|United States|US          |admin     |Michigan      |Michigan, USA      |
|United States|US          |admin     |South Carolina|South Carolina, USA|
|United States|US          |admin     |Virginia      |Virginia, USA      |
|United States|US          |admin     |South Dakota  |South Dakota, USA  |
|United States|US          |admin     |Louisiana     |Louisiana, USA     |
|United States|US          |admin     |Florida       |Florida, USA       |
|United States|US          |admin     |Indiana       |Indiana, USA       |
+-------------+----------

#### City

In [106]:
city = sqlContext.sql("SELECT * FROM Places WHERE place_type = 'city'")
city.show(5, truncate=False)

+--------+------------+----------+-------------+-----------------------+
|country |country_code|place_type|name         |full_name              |
+--------+------------+----------+-------------+-----------------------+
|Portugal|PT          |city      |Barcelos     |Barcelos, Portugal     |
|Brasil  |BR          |city      |São Luís     |São Luís, Brasil       |
|Malaysia|MY          |city      |Petaling Jaya|Petaling Jaya, Selangor|
|Germany |DE          |city      |Illmensee    |Illmensee, Deutschland |
|Ireland |IE          |city      |Kildare      |Kildare, Ireland       |
+--------+------------+----------+-------------+-----------------------+
only showing top 5 rows



#### Neighborhood (US examples)

In [107]:
neighborhood = sqlContext.sql("SELECT * FROM Places WHERE place_type = 'neighborhood' AND country_code = 'US'")
neighborhood.show(10, truncate=False)

+-------------+------------+------------+-------------------+------------------------------+
|country      |country_code|place_type  |name               |full_name                     |
+-------------+------------+------------+-------------------+------------------------------+
|United States|US          |neighborhood|Duboce Triangle    |Duboce Triangle, San Francisco|
|United States|US          |neighborhood|Downtown           |Downtown, Houston             |
|United States|US          |neighborhood|South Los Angeles  |South Los Angeles, Los Angeles|
|United States|US          |neighborhood|Cabbagetown        |Cabbagetown, Atlanta          |
|United States|US          |neighborhood|Downtown           |Downtown, Memphis             |
|United States|US          |neighborhood|Downtown           |Downtown, Houston             |
|United States|US          |neighborhood|Hollywood          |Hollywood, Los Angeles        |
|United States|US          |neighborhood|Clinton            |Clinton, 

#### POI (US examples)

In [108]:
poi = sqlContext.sql("SELECT * FROM Places WHERE place_type = 'poi' AND country_code = 'US'")
poi.show(10, truncate=False)

+-------------+------------+----------+---------------------------------------------+---------------------------------------------+
|country      |country_code|place_type|name                                         |full_name                                    |
+-------------+------------+----------+---------------------------------------------+---------------------------------------------+
|United States|US          |poi       |Bice Cucina Miami                            |Bice Cucina Miami                            |
|United States|US          |poi       |Ala Moana Beach Park                         |Ala Moana Beach Park                         |
|United States|US          |poi       |Los Angeles Convention Center                |Los Angeles Convention Center                |
|United States|US          |poi       |Cleveland Hopkins International Airport (CLE)|Cleveland Hopkins International Airport (CLE)|
|United States|US          |poi       |Indianapolis Marriott Downtown       