# Swiss geo tweet - Affluence map and mobility patterns in Switzerland

## Abstract

The aim of this project is to:
- build a data set of Switzerland based on various social network sources (geo-located tweets and possibly instagram posts at the moment) to compute a flow analysis into an exploitable dataset for visualization (ex: timeline map, etc.)
- infer mobility patterns from it and try to detect from regular events (ex: people living in the Vaud canton but working in the Geneva canton) to major events (ex: Paleo Festival, Geneva auto showroom)

With the visualization, people will have the ability to get an affluence overview in time (months, days, hours) and space (main axes, cantons, cities, places). From there, they can filter the map and display the locations they want to visit according if it's crowded or not at that time of the year/day.

## Data Description

Our main source of data will be Twitter (and possibly Instagram posts).

There's already a dataset composed of tweets in Switzerland from 2012. The [Twitter API overview](https://dev.twitter.com/overview/api) gives informations about what fields can be fetched from tweets.

We don't already have a dataset containing Instagram posts in Switzerland and it is part of our project to see if we can get one. The [Instagram API endpoints](https://www.instagram.com/developer/endpoints/) gives informations about what can be fetched from Instagram posts (such as the [location](https://www.instagram.com/developer/endpoints/locations/)).

## Feasibility and Risks

This project will require some challenging tasks. First of all, we need to get the corresponding datas. As we don't have it for Instagram posts in Switzerland, we will need to find a way to get it for some months or years if possible. We can perform some Instagram Mining using [python-instagram]( https://github.com/facebookarchive/python-instagram).

For tweets, a dataset is already collected. Hence we have to extract the relevant informations which are mainly the events' hashtags, the localizations and maybe users' ids. One difficulty is that not all tweets were produced by a device enabling geo-location. It may decrease the size of our data.

In addition, once our data characteristics will be extracted from the tweets, one difficulty will be to infer the users' type of locations (workplace/home or in-between point). Then, we will need to identify events' hashtags. As they are not really structured (mispellings, lot of variants for the same event, etc.), it may be difficult to infer the correct context/informations from them. It will also be interesting to do some analysis on the selected tweets's texts in order to have an idea about an event's characteristics for instance.

Once both datasets are collected, we will need to merge them and think about a representation that makes the storage size not to big and that allows us to query the data easily.

One point we can notice is that the data may not be representative of all the population we're interested in (Twitter and Instagram accounts tends to be more popular and used by the new generations). 

## Deliverables

As previously explained, the final goal of this project is to deliver an exploitable data-set (ex: JSON file) of the population movements in Switzerland and its neighbor areas through time while giving some additional informations on key population gatherings such as events and others based on tweets and instagram posts.

## Possible timeplan

The first draft of the timeplan for this project is : 
- 1-2 weeks : research on what have been already done regarding work on mobility 
- 2 weeks : Fetching the data from Instagram for Switzerland and for some periods of time (if possible the same period as we have for the tweets : 2012-2016)
- 4 weeks : Interpreting the given datasets of tweets given.
- 1 week : Filtering the informations needed in these two datasets.
- 2-3 weeks : Thinking about a memory representation that could easily fit our final vizualisation.


In [1]:
#required imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
import string
import collections
import re
import pycountry
from os import path

# Shema of the data

1    id    bigint(20)        UNSIGNED    No    None

2    userId    bigint(20)        UNSIGNED    No    None

3    createdAt    timestamp            No    0000-00-00 00:00:00

4    text    text    utf8_unicode_ci        No    None

5    longitude    float            Yes    NULL

6    latitude    float            Yes    NULL

7    placeId    varchar(25)    utf8_general_ci        Yes    NULL

8    inReplyTo    bigint(20)        UNSIGNED    Yes    NULL

9    source    int(10)        UNSIGNED    No    None

10    truncated    bit(1)            No    None

11    placeLatitude    float            Yes    NULL

12    placeLongitude    float            Yes    NULL

13    sourceName    varchar(255)    utf8_general_ci        Yes    NULL

14    sourceUrl    varchar(255)    utf8_general_ci        Yes    NULL

15    userName    varchar(200)    utf8_general_ci        Yes    NULL

16    screenName    varchar(200)    utf8_general_ci        Yes    NULL


17    followersCount    int(10)        UNSIGNED    Yes    NULL

18    friendsCount    int(10)        UNSIGNED    Yes    NULL

19    statusesCount    int(10)        UNSIGNED    Yes    NULL

20    userLocation    varchar(200)    utf8_general_ci        Yes    NULL

In [2]:
col_data = ['id', 'userId', 'createdAt', 'text', 'longitude', 'latitude', 'placeId', 'inReplyTo', 'source', 'truncated', 'placeLatitude', 'placeLongitude', 'sourceName', 'sourceUrl', 'userName', 'screenName', 'followersCount', 'friendsCount', 'statusesCount', 'userLocation']

In [3]:
tweets = pd.read_csv('twitter-swisscom/twex_split_1/twex_1.tsv',names=col_data, sep='\t')
tweets.head()

Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
0,9514097914,17341045.0,2010-02-23 05:55:51,Guuuuten Morgen! :-),7.43926,46.9489,\N,\N,197,,\N,\N,TwitBird,http://www.nibirutech.com,Tilman Jentzsch,blickwechsel,586,508.0,9016.0,"Bern, Switzerland"
1,9514846412,7198282.0,2010-02-23 06:22:40,Still the best coffee in town — at La Stanza h...,8.53781,47.3678,\N,\N,550,,\N,\N,Gowalla,http://gowalla.com/,Nico Luchsinger,halbluchs,1820,703.0,4687.0,"Zurich, Switzerland"
2,9516574359,14657884.0,2010-02-23 07:34:25,It has been a week or so.. and today I just co...,6.13396,46.1951,\N,\N,3,,\N,\N,foursquare,http://foursquare.com,Javier Belmonte,vichango,167,277.0,2885.0,"Geneva, Switzerland"
3,9516952605,14703863.0,2010-02-23 07:51:47,Getting ready.. http://twitpic.com/14v8gz,8.81749,47.2288,\N,\N,62,,\N,\N,Twittelator,http://stone.com/Twittelator,Urs,ugro,75,161.0,1390.0,"Zürich, Switzerland"
4,9517198943,14393717.0,2010-02-23 08:02:57,Un peu de réconfort liquide en take away après...,6.63254,46.5199,\N,\N,3,,\N,\N,foursquare,http://foursquare.com,Romain P.,PIMboula,135,109.0,2381.0,"Lausanne, Suisse"


In [5]:
if 0 :
    tweets_full = pd.read_csv('twitter-swisscom/twex.tsv',names=col_data, sep='\t')
    tweets_full.head()

### Cannot read sample.tsv ?

In [6]:
if 0:
    tweets = pd.read_csv('twitter-swisscom/sample.tsv',names=col_data, sep='\t')
    tweets.head()