# Python Series Part 2: Understanding The Python Package Ecosystem

Welcome to part 2 of our Python workshop series! In this workshop, we'll be covering one of the most useful parts of Python: its extensive ecosystem of helpful *packages*. These packages are an incredibly valuable resource to researchers, as they can perform a lot of the common tasks you'll frequently encounter when doing data science.

This guided tutorial is presented as an interactive Jupyter notebook, and you are invited to follow along by running the code in your own notebook.

Learning objectives:
*   Understand what packages are and how to use them
*   Understand the process of installing new packages through PyPI and `pip`
*   Familiarize yourself with the list of commonly used data science packages, and know which package to use for each specific task you might need in your own research
*   Get hands-on experience with `numpy`, the package that forms the basis of most of the Python data science ecosystem



## 1. What are packages?

A *package* is a collection of premade functions designed to help with specific tasks. For example, in the social sciences we often need to generate random numbers (for example, when running a simulation), and there is a package called `random` (https://docs.python.org/3.8/library/random.html) that provides functions for doing this.

To use the functions in a package, we first need to tell Python that we want to use the package. To do this, we use the `import` syntax:

In [None]:
import random

Once we have imported the package, we can access its functions using the "." syntax, the same syntax used to access methods.

Let's explore this by playing with some of the functions from the `random` package. We'll look at two functions:
*   `randint(a,b)`: generates a random integer that is greater than or equal to `a` but smaller than `b`
*   `choice(l)`: picks a random value from the list `l`

In [None]:
rand_int1 = random.randint(0, 10) # create and print 3 random integers
rand_int2 = random.randint(0, 10)
rand_int3 = random.randint(20, 100)
print(rand_int1, rand_int2, rand_int3)

6 9 58


In [None]:
names = ['Jonathan', 'Samantha', 'Xiaomeng', 'Aishat', 'Florio', 'Jacob']
print(random.choice(names)) # pick a random name

Jonathan


If you know that you only need a few specific functions from a package, instead of importing the whole package you can only import those specific functions. This is convenient because you can then use those functions without the "." syntax. To import a specific function, use the syntax `from [package name] import [function]`:

In [None]:
from random import randint, choice
# now we can use randint and choice without having to type "random." first
print(randint(0, 10))
print(choice(names))

5
Jacob


## 2. Installing packages: PyPI and pip

Some packages, such as `random`, are automatically installed for you when you first install Python. But what if we want more packages? To get more packages, we will need to install them ourselves.

Python packages are hosted online on the Python Package Index (PyPI): https://pypi.org/. Feel free to take a look around and browse the packages that are available!

To install a package, use `pip`:

In [None]:
!pip install python-twitter

Collecting python-twitter
  Downloading python_twitter-3.5-py2.py3-none-any.whl (67 kB)
[?25l[K     |████▉                           | 10 kB 23.6 MB/s eta 0:00:01[K     |█████████▊                      | 20 kB 27.7 MB/s eta 0:00:01[K     |██████████████▋                 | 30 kB 20.8 MB/s eta 0:00:01[K     |███████████████████▌            | 40 kB 13.8 MB/s eta 0:00:01[K     |████████████████████████▎       | 51 kB 10.5 MB/s eta 0:00:01[K     |█████████████████████████████▏  | 61 kB 10.4 MB/s eta 0:00:01[K     |████████████████████████████████| 67 kB 3.8 MB/s 
Installing collected packages: python-twitter
Successfully installed python-twitter-3.5


When installing a package, the name you type has to match the official package name **exactly**, otherwise the installation will fail. If you are unsure of the official name of the package, check its page on the PyPI website.

In [None]:
# this will fail because the official name of the package is python-twitter, not pythontwitter
!pip install pythontwitter

[31mERROR: Could not find a version that satisfies the requirement pythontwitter (from versions: none)[0m
[31mERROR: No matching distribution found for pythontwitter[0m


### Exercise 2.1

Cornell publishes a Python package for analyzing conversational data, called the "Cornell Conversational Analysis Toolkit." Search for this package on the PyPI website to find out what its exact name is, and then install it with `pip`.

In [None]:
# fill in the blank with the exact name you found on the PyPI website
!pip install __________

## 3. Packages for data science

Python is very commonly used for data science, and there is a large family of data science packages designed to work well together. If you are using Python in social science research, chances are you will end up wanting to use at least one of these packages!

Here is a table that covers some common data science tasks you might need to do in social science research, and the Python packages that support them:
*   Processing numerical data: `numpy`
*   Advanced statistics: `scipy`, `statsmodels`
*   Data visualization: `matplotlib`
*   Tabular data management (think spreadsheets): `pandas`
*   Network analysis: `networkx`
*   Machine learning: `scikit-learn`
*   Deep learning: `pytorch`, `tensorflow`,
*   Natural language processing: `nltk`

### Exercise 3.1

As a researcher using Python, an important skill is to understand what packages to use for any specific task you find yourself doing! Let's practice this now. For each of the scenarios below, try to identify which data sciences package (from the above table) you would need. Some scenarios might need more than one package!

1. Modeling the spread of misinformation through a social network
2. Creating plots for a research paper
3. Analyzing a collection of tweets
4. Loading numerical measurements from a field experiment and running statistical comparisons on them
5. Training a machine learning model on a public dataset provided in the form of a spreadsheet

## 4. A quick look at numpy

We previously mentioned that the many packages in the Python data science ecosystem are designed to work well with each other. One reason they are able to do this is that many of them are built on a common foundation: the `numpy` package. Because `numpy` plays such a key role in the data science ecosystem, it is a good idea to get familiar with it, as you will very likely end up having to use it either directly or indirectly.

Let's start by installing and importing `numpy`.

In [None]:
!pip install numpy



In [None]:
import numpy

`numpy` provides a special type of object known as an `array`. Arrays are similar to lists, but they are optimized for numerical data. Other data science packages often rely on `numpy` arrays to store data, and then provide additional functions to do specific kinds of data science on that data.

Let's get a better understanding of how `numpy` arrays work and how they differ from regular lists. First, we can create an array out of a list of numbers using the `asarray` function:

In [None]:
demo_array = numpy.asarray([10, 47, 1, 0, 100, 999])
print(type(demo_array))

<class 'numpy.ndarray'>


The usual list indexing and slicing operations also work on arrays:

In [None]:
print(demo_array[1])
print(demo_array[-1])
print(demo_array[2:5])
print(demo_array[:2])
print(demo_array[5:])

47
999
[  1   0 100]
[10 47]
[999]


But numpy also supports *advanced indexing* that lets you "smartly" filter array items based on numerical criteria. This is best understood through an example. Suppose you have a dataset containing people's ages, and you want to filter out individuals below a certain age. Using regular lists, recall that we would do this using a combination of a for loop and a conditional:

In [None]:
ages = [20, 43, 12, 88, 97]
filtered = []
for age in ages:
    if age > 50: # we want to select only the elders
        filtered.append(age)
print(filtered) # filtered only contains the two elders, ages 88 and 97

[88, 97]


But with `numpy` arrays, the same thing can be done in one line of code using the filtering syntax (note: for those familiar with R, the syntax is similar):

In [None]:
ages = numpy.asarray(ages) # first we need to convert the list to an array
filtered = ages[ages > 50]
print(filtered) # we get the same result!

[88 97]


Filters can also be combined using the `&` (and) and `|` (or) operators. For example, if we only want to select "middle age" individuals (between ages 20 and 50), we can do this by combining two comparisons:

In [None]:
middle_age_only = ages[(ages > 20)&(ages < 50)]
print(middle_age_only)

[43]


What if we want the opposite filter: selecting all individuals **except** for middle age ones? This means we want individuals who are *either* younger than 20 *or* older than 50. Because this is an "or" comparison, we use the `|` operator:

In [None]:
no_middle_age = ages[(ages <= 20)|(ages >= 50)]
print(no_middle_age)

[20 12 88 97]


### Exercise 4.1

In the code cell below, the `years` array contains publication years for a dataset of news articles, ranging from 1800 to 2020. Write filters for each of the following:
1. Articles from before the 21st century
2. Articles written in the year 2000
3. Articles from the 20th century only

In [None]:
years = np.asarray([1812, 1905, 1856, 2020, 2001, 1984, 1945, 1890, 2001, 2000, 1905, 2016, 2016, 1904, 1900, 2000, 2001, 1936, 2008, 2001, 1888, 1921, 1995, 2014])
before21st = __________       # fill in your answer to problem 1 here
year2k = __________           # fill in your answer to problem 2 here
twentiethcentury = __________ # fill in your answer to problem 3 here
print(before21st)
print(year2k)
print(twentiethcentury)

Arrays also have methods implementing simple statistical summaries, such as mean and standard deviation (more advanced statistics require the use of a separate package such as `scipy`):

In [None]:
print(ages.mean()) # compute the mean of an array using array.mean
print(ages.std()) # compute the standard deviation of an array using array.std

52.0
34.71599055190562


So far, we have looked only at one-dimensional arrays, which operate similarly to lists. But arrays can also be multi-dimensional! In particular, two-dimensional arrays are useful to represent matrices. A 2D array can be created from a list of lists, where each nested list represents one row of the matrix, ordered from top to bottom. Thus the following code creates a matrix with 3 rows and 5 columns:

In [None]:
matrix3x5 = numpy.asarray([[1, 0, 4, 5, 7], [10, 0, 0, 1, 2], [5, 2, 4, 8, 9]])
print(matrix3x5)

[[ 1  0  4  5  7]
 [10  0  0  1  2]
 [ 5  2  4  8  9]]


To index a specific item from a 2D array, you have to specify the row followed by the column, in that order, separated by a comma:

In [None]:
print(matrix3x5[0,3]) # grabs the element in the first row and fourth column
print(matrix3x5[-1,1]) # negative indexing still works, this gets the element in the last row and second column

5
2


There are more advanced things you can do, such as creating arrays with more than 2 dimensions and slicing multi-dimensional arrays, but those will not be covered here since this is not a `numpy` workshop.

## 5. Final Exercise

Let's try combining everything we have learned. This exercise represents a task you might commonly encounter in social science research, which can be performed using functions from certain packages you need to install.

## The Problem

We are working with a dataset of shopping activity at two competing grocery stores. This dataset specifically measures the amount of money spent, rounded to the nearest dollar, by shoppers at the two stores. The measurements have been represented as two lists, one for each store:

In [None]:
store1 = [33, 32, 55, 44, 53, 19, 50, 34, 37, 41, 81, 65, 50, 18, 57, 31, 49,
          33, 43, 52, 41, 35, 21, 26, 28, 48, 48, 51, 36, 32, 40, 43, 51, 49,
          38, 20, 31, 43, 50, 53,  3, 48, 63, 31, 58, 55, 45, 13, 56, 30]
store2 = [43, 40, 49, 59, 62, 47, 49, 86, 45, 52, 39, 68, 68, 35, 78, 79, 75,
          57, 76, 40, 60, 67, 68, 46, 64, 52, 59, 58, 70, 61, 85, 81, 65, 53,
          60, 74, 42, 56, 45, 59, 22, 50, 42, 84, 58, 58, 76, 41, 49, 37]

We want to find out whether the average amount of money spent per shopper differs between the two stores.

### Task 1

We will begin by simply computing the means amount of money spent per shopper at the two stores. To do this, first convert the two lists into `numpy` arrays, and use the `mean` method.

In [None]:
# Your solution to Task 1 goes here


### Task 2

Hopefully, you found in your answer to Task 1 that the means appear to be different. But in social science we typically want to be a bit more careful: sometimes, differences in means can just be due to noise in the data. We therefore need to know whether the difference is statistically significant. A common way of doing this is through the t-test. The `scipy` package contains a method, `scipy.stats.ttest_ind`, that implements the 2-sample t-test. Install `scipy` and use the `scipy.stats.ttest_ind` method to run the t-test on our data.

HINT: The weird syntax in `scipy.stats.ttest_ind`, with the two "."s, is because the `scipy` package is actually split into multiple "sub-packages", one of which is called `stats`. Your `import` line will need to reflect this, as you will need to import the `scipy.stats` sub-package specifically; just importing `scipy` will not work.

In [None]:
# First, install scipy using pip in this code cell


In [None]:
# Then put the rest of your solution to Task 2 here.