# Twitter scraper by country
This is a simple project which scrapes 100 to 1000 tweets from a given country.

### Technologies Used
* Web scraping
* Tweepy API 
* File storage (using csv module)
* ReGex expressions (using re module)

### Requirements
* APIs: tweepy 
* Modules: csv, re
* Twitter Developer credentials

In [None]:
%config IPCompleter.greedy=True

import tweepy
import csv
import re

First, we set up our authentication details for accessing Twitter and initialise our Tweepy API. We will need a/an:
* access key, 
* acces secret, 
* token key, and 
* token secret. 

In [2]:
# Create authentication for accessing Twitter

# TODO: Enter your Twitter Developer credentials (Access key, Access secret, Token key, Token secret)
auth = tweepy.OAuthHandler("", "")
auth.set_access_token("", "")

# Initialize Tweepy API
api = tweepy.API(auth)

Next, we open a .csv file which we will use to store all our tweets from a given country. Then, we create a Writer to the file and proceed to write a row of column headers to the now-blank .csv file.

In [3]:
# Open/Create a file to append data
csvFile = open('tweets.csv', 'a', newline='')

def create_file(file):
    # Use csv Writer
    csvWriter = csv.writer(file)

    # Create a header row in the Excel sheet
    csvWriter.writerow(["Date-time created", "Tweet Content", "User Screen Name", "Tweet Place", "Tweet Coordinates", "Tweet Geo Code"])
    
    return csvWriter

The next function takes in a country name as well as a .csv Writer. With both of these, both a geo-search and a regular search are used to pinpoint the country and get the tweets from that country respectively.

In this function, we clean up each tweet in a number of different ways. First, we use the tweet-id and api.get_status() to get the full text of each tweet (recall that tweets are usually truncated). Second, we remove all the new line characters. Third, we remove all the emojis using a ReGex expression for byte code emojis.

We also close the file in this function. 

In [4]:
 def find_tweets_from(country, writer):
    # Use a geo-search to get the country in question
    places = api.geo_search(query=country, granularity="country")
    place_id = places[0].id

    # Use a regular search based on the output of the geo-search to get (at most) 1000 tweets
    for tweet in tweepy.Cursor(api.search, q="place:%s" % place_id).items(1000):

        # Enable app to access full tweets rather than the default truncated tweets
        status = api.get_status(tweet.id, tweet_mode="extended")

        # Replace all the next-line characters
        tweet_converted = status.full_text.replace('\n',' ')

        # Remove all emojis
        tweet_converted = re.sub(r'[^\x00-\x7F]+',' ', tweet_converted)

        # Create a row for each filtered tweet
        writer.writerow([tweet.created_at,\
                            tweet_converted.encode('utf-8'),\
                            tweet.user.screen_name,\
                            tweet.place.name if tweet.place else "Undefined place",\
                            tweet.coordinates,\
                            tweet.geo])

    csvFile.close()    

Finally, the user provides the name of a country (e.g. USA, Nigeria, Ghana) and the appropriate functions are called. To view our dcraped data, we navigate to the folder holding this notebook and double-click on the Excel sheet with the same name. 

In [None]:
country_name = input("Enter country name: ")

# Execute main code block
find_tweets_from(country_name, create_file(csvFile))

print("Scraping complete!")

## References
* http://docs.tweepy.org/en/latest/cursor_tutorial.html
* http://docs.tweepy.org/en/latest/extended_tweets.html
* http://docs.tweepy.org/en/latest/api.html?highlight=place#API.geo_id
* https://towardsdatascience.com/extracting-twitter-data-pre-processing-and-sentiment-analysis-using-python-3-0-7192bd8b47cf
* https://stackoverflow.com/questions/17633378/how-can-we-get-tweets-from-specific-country
* https://stackoverflow.com/questions/3348460/csv-file-written-with-python-has-blank-lines-between-each-row