### Data Acquisition

This notebook includes all actions to get the data from Twitter.<br>
**Data**: Tweets that include '@BiciMad' (text and profile name)

## 1. Set up Tweepy

#### 1.1 My Appi 

In [1]:
import tweepy


auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

api = tweepy.API(auth, wait_on_rate_limit=True)
api

<tweepy.api.API at 0x112210d30>

#### Appi 2

#### Appi 3

## 2. Query data
<hr>

### 2.1 Get Tweets

In [2]:
import datetime

In [3]:
import pandas as pd
search_term = 'BiciMAD'

# Creation of query method using parameters
# tweets = tweepy.Cursor(api.user_timeline,id=username).items(count)
tweets = tweepy.Cursor(api.search, 
                   q=search_term,
                   lang="es", tweet_mode="extended", since=datetime.datetime.today().strftime('%Y-%m-%d')).items()
 
tweets_list = [[tweet.created_at, tweet.id, tweet.full_text, tweet.user.name, tweet.user.id, tweet.user.screen_name] for tweet in tweets]

# Creation of dataframe from tweets list
# Add or remove columns as you remove tweet information
tweets_df = pd.DataFrame(tweets_list)

In [4]:
tweets_df

Unnamed: 0,0,1,2,3,4,5
0,2020-10-23 16:28:20,1319677044989493248,@mpjc2 @begonavillacis @BiciMAD @emt @EliteTax...,Jesús García Diaz,955910563820986368,JessGarcaDiaz3
1,2020-10-23 16:27:02,1319676716349050880,­­📰 El distrito de Latina tiene ya servicio de...,Madrid Actual,50969179,madridactual
2,2020-10-23 16:19:55,1319674925884592130,RT @diego_rebollo: @begonavillacis @BiciMAD As...,alpedhuez,823491982219689984,alpedhu37343368
3,2020-10-23 16:19:45,1319674884700672000,RT @CANAL33: El alcalde de Madrid @AlmeidaPP_ ...,Juanchiiii 😅🚇🖥️😍 👬,811543987375013888,deleganada
4,2020-10-23 16:14:38,1319673595929464832,@begonavillacis @Bogeymaann @BiciMAD Que no se...,paco anaya,215467698,acinrag_
...,...,...,...,...,...,...
432,2020-10-23 02:26:26,1319465173086818305,RT @Rita_Maestre: La foto que resume lo que ha...,Dani Aparicio,126324648,dani_apa69
433,2020-10-23 01:03:15,1319444240099934208,RT @Esther_Gomez_M: Para conocer la difícil si...,Rober,2870657137,zalorpass
434,2020-10-23 00:13:54,1319431821067694082,RT @Esther_Gomez_M: Prometió @AlmeidaPP_ 50 nu...,Kike,412800059,KikeM_
435,2020-10-23 00:11:28,1319431206086332419,RT @edugaresp: El servicio de alquiler de bici...,antonio 🏳️‍🌈antonio MAD,978551226,antonio44860896


In [5]:
df_tweets = tweets_df.copy()

### 2.2 Clean data
<hr>

- **Select columns** 
<br> 
    Select columns and rename
<br>
- **Drop duplicates** 
<br>
    Drop duplicate tweets based on 'text' Column

In [8]:
# Rename columns
df_tweets.set_axis(['date','id', 'text', 'user_name','user_id', 'user_screen_name'], axis=1, inplace=True)

In [9]:
# Drop duplicates before sav
# Read alrady existing data 
df_old = pd.read_csv('../data/raw/rawdata.csv')
df_old = df_old.drop(columns =['Unnamed: 0'])
df_old = df_old.astype(str)

In [10]:
df_str = df_tweets.astype(str)

In [11]:
df = pd.merge(df_old, df_str, how ='outer')

In [12]:
df = df[df.date != 'date']

In [13]:
df.drop_duplicates(subset=['id'],keep='last', inplace= True)

In [14]:
df.reset_index()

Unnamed: 0,index,date,id,text,user_name,user_id,user_screen_name
0,0,2020-10-10 11:07:26,1314885243531399169,@PlataformaEMT @FRAVM @BiciMAD @AlmeidaPP_ @bc...,VICENTE RODRIGUEZ MU,,
1,1,2020-10-10 10:38:11,1314877884067188736,@PlataformaEMT @MADRID @AlmeidaPP_ @bcarabante...,Stielike,,
2,2,2020-10-10 10:35:49,1314877290489184261,@mcascallares Ha podido tratarse de un fallo p...,BiciMAD,,
3,3,2020-10-10 10:35:14,1314877141658542083,"@AnxoOroisPhoto Te pedimos disculpas, ha podid...",BiciMAD,,
4,4,2020-10-10 10:13:18,1314871622487212033,"Hola @BiciMAD la app no funciona, llevo 1 hora...",AnxoOroisPhotography,,
...,...,...,...,...,...,...,...
9151,9151,2020-10-23 15:10:21,1319657417941057537,@begonavillacis @BiciMAD Cuantas inaguraciones...,Cristal45,1263212011028131840,Cristal454
9152,9152,2020-10-23 15:09:24,1319657181889912832,RT @AMovilidadMad: ¿Te mueves en bicicleta 🚲 p...,NANO,821767980,nanoveros
9153,9153,2020-10-23 15:06:11,1319656370736668673,"RT @CsMadridCiudad: 🚲 Desde @MADRID, seguimos ...",Ciudadanos Utrera,4202843157,Cs_Utrera
9154,9154,2020-10-23 15:03:03,1319655584053026816,El distrito de Latina tiene ya servicio de Bic...,Antonio Arroyo Quero,275204843,Antonio__AQ


In [15]:
# check new Tweets are in df
df.sort_values('date', ascending = False).head(10)

Unnamed: 0,date,id,text,user_name,user_id,user_screen_name
9109,2020-10-23 16:28:20,1319677044989493248,@mpjc2 @begonavillacis @BiciMAD @emt @EliteTax...,Jesús García Diaz,955910563820986368,JessGarcaDiaz3
9110,2020-10-23 16:27:02,1319676716349050880,­­📰 El distrito de Latina tiene ya servicio de...,Madrid Actual,50969179,madridactual
9111,2020-10-23 16:19:55,1319674925884592130,RT @diego_rebollo: @begonavillacis @BiciMAD As...,alpedhuez,823491982219689984,alpedhu37343368
9112,2020-10-23 16:19:45,1319674884700672000,RT @CANAL33: El alcalde de Madrid @AlmeidaPP_ ...,Juanchiiii 😅🚇🖥️😍 👬,811543987375013888,deleganada
9113,2020-10-23 16:14:38,1319673595929464832,@begonavillacis @Bogeymaann @BiciMAD Que no se...,paco anaya,215467698,acinrag_
9114,2020-10-23 16:06:46,1319671615530143745,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,Pilukipilarapilarin 🎗,906408624574124032,Pilukipilarapi1
9115,2020-10-23 16:05:41,1319671343340789765,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,Loli,2149535325,thebestloli
9116,2020-10-23 16:05:14,1319671229243162624,RT @PlataformaEMT: Nosotr@s como trabajador@s ...,Loli,2149535325,thebestloli
9117,2020-10-23 16:01:50,1319670376339832833,@CarajoPS @Afectadosbicim1 @busero_enmascar @b...,Rafael Cacho,3183939634,rafaelcach
9118,2020-10-23 15:59:40,1319669832158162954,@Afectadosbicim1 @CarajoPS @busero_enmascar @b...,Rafael Cacho,3183939634,rafaelcach


### 2.3 Save to csv

- **Check Dataframe shape** 
<br>
        Check df shape
<br>
        Check new tweets (i.e. difference between old and updated df)
- **Save to existing 'rawdata.csv'** 
<br> 
        Save only aditional tweets (i.e. df updated) 
<br>

In [16]:
# Updated df shape (rows cols)
df.shape

(9156, 6)

In [17]:
# New tweets 
df.shape[0] - df_old.shape[0]

47

In [18]:
# save to csv - add a dataframe to an existing csv file
df.to_csv('../data/raw/rawdata.csv', header=True)