## EPSY 5200: Programming Fundamentals for Social Science Researchers
## Fall 2020 Week 12
### Web Scraping with Python

In the latter half of class, we'll use a Python package called `twint` to scrape data from Twitter (the Twitter API process is laborious, so we'll use this alternative).  However, Anaconda doesn't have `twint` built in or in the Anaconda Cloud, so we need to go through the process to install it.

This would be one typical process for doing that (kept here for posterity), but we're going to use the method in the next cell:

Go to https://github.com/twintproject/twint

In Terminal/PowerShell, run:
`conda info --envs`

This returns the directory where `conda`'s default (`base`) environment is stored (note: `conda` is the `pip` alternative used by Anaconda for Python package management).

Let's create a new environment:

`conda create --name epsy`

And activate it (switch environments):

`conda activate epsy`

(Note in PowerShell, you may need to run `cmd` to get the command prompt before activate can work.)

Install git:

`conda install git`

Install pip:

`conda install pip`

Now we can link with `git` and `pip` to the GitHub for the `twint` project.  On the `twint` page, get the URL you'd use for cloning (https://github.com/twintproject/twint.git) and replace the `https` protocol with `git` like so:

`pip install git+git://github.com/twintproject/twint.git`


Now if we go to Anaconda Navigator > Environments > epsy, we can search for `twint` and see it's there.  We also have another package we'll use, `nest_asyncio`, which is necessary when using a notebook like Jupyter.

In [None]:
# Unfortunately, twint is not always reliable and development is a bit buggy
# This happens with open-source programming, as developers are often volunteers
# In this case, it took me some time to realize that the most recent version has a couple bugs

# Here is some code I found in reading the GitHub Issues discussion (run the uncommented line)
# Note that this is a *terminal* command, not Python.
# In Anaconda, we can run the code in terminal instead of Python by using "!" :

#!pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
#!pip3 install --user --upgrade git+https://github.com/yunusemrecatalcam/twint.git@twitter_legacy2
!pip install --upgrade git+https://github.com/yunusemrecatalcam/twint.git@twitter_legacy2

In [None]:
# run these in case you don't have them installed:
!pip install nest_asyncio
!pip install aiohttp_socks

In [None]:
# and let's make sure aiohttp_socks is up to date (upgraded):
!pip install --user --upgrade aiohttp_socks

In [None]:
# now we can import packages we need
import twint
import pandas as pd
import nest_asyncio # allows for twint to play w an interactive notebook like jupyter
nest_asyncio.apply() # activate asynchronous input/output handling