<img src='images/gesis.png' style='height: 60px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>
<img src='images/isi.png' style='height: 50px; float: left; margin-left: 20px'>  

Authors: Haiko Lietz, N. Gizem Bacaksizlar Turbic and Pouria Mirelmi

### 0. Initial notes by Haiko & Gizem:

IN GENERAL: INTRODUCE THINGS HERE WHICH WE NEED IN SESSIONS 3-14. THAT MEANS, WHENEVER WE 

#### Textbooks & sources

- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://www.pythonlikeyoumeanit.com/index.html

#### Notes
- Focus here is on flat data structures (Pandas dataframes) and mathematical data structures (NumPy arrays), add also regex here instead of showing them in the functions of NLP?; hierarchical data structures (JSON and HTML) are covered in session 4.

#### Essentials

https://www.pythonlikeyoumeanit.com/module_2.html

BOX: OBJECT-ORIENTED PROGRAMING

https://www.pythonlikeyoumeanit.com/module_4.html

# 1. Introduction



In this notebook, you are going to learn about some of the most important libraries and techniques for handling and visualizing data in python. We will walk you through basic methods of libraries like Numpy, Pandas, Scipy etc. What you'll learn in this notebook is fundamental to what you will learn in the other notebooks.

## 1.1 Reading/writing text files

In the first notebook, we went through some basics of python(did we??). In this section, you will learn more about handling files, with a focus on reading text data for further analysis.

Let's say you want to create a text file and write something to it. You can do that with the following lines of code:

In [6]:
my_file = open("test.txt", "w")
my_file.write("Some text")
my_file.close()

The first line opens (and creates) a text file named *test.txt*, and using the second argument of `open` ("w"), it tells python that we want to *write* something to it.
In the second line, we write "Some text" to the file. And with the 3rd line, we close the file and save what we have written to it.

If we want to read the text in the file, we can use these three lines with some minor change like this:

In [10]:
my_file = open("test.txt", "r")
text = my_file.read()
my_file.close()

In [15]:
print (text)

Some text


By changing the second argument of `open` to "r", we tell python that we want to *read* what is in the file. In the second line, we save what we have read in the `text` variable, and then we close the file.

We can also do the above operations in a better way as follows.

For writing to a file:

In [16]:
with open("test.txt", "w") as my_file:
    my_file.write("Some new text")

For reading the file:

In [19]:
with open("test.txt", "r") as my_file:
    text = my_file.read()
    
print (text)

Some new text


## 1.2 Reading text from PDF files


<img src='images/pdf.png' style='height: 120px; float: right; margin-left: 50px' >

For analysing text data, we may require reading data from pdf files. In order to do that in an efficient way, we can use **PyPDF2** library. It is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It has many other useful features for working with pdf files, but our emphasis here is on reading data from pdf files. For more information on all the capabilities you can take a look at [its documentation](https://pypdf2.readthedocs.io/en/latest/).

You can install the library with `pip`:

In [20]:
pip install PyPDF2

Note: you may need to restart the kernel to use updated packages.


### 1.2.1 Reading Metadata

We'll begin with reading the metadata from a pdf file. We have used *Generative Adversarial Networks* paper by Ian Goodfellow as an example pdf file. You can use any other file of your own choice.

In [21]:
from PyPDF2 import PdfReader

In [22]:
reader = PdfReader("GAN.pdf")

In [23]:
meta = reader.metadata
meta

{'/CreationDate': "D:20201006125850-04'00'",
 '/Creator': 'Adobe InDesign 15.1 (Macintosh)',
 '/ModDate': "D:20201006125856-04'00'",
 '/Producer': 'Adobe PDF Library 15.0'}

In [24]:
print(meta.author)
print(meta.creator)
print(meta.producer)
print(meta.subject)
print(meta.title)

None
Adobe InDesign 15.1 (Macintosh)
Adobe PDF Library 15.0
None
None


Please note that all of the above values could be `None` for your own pdf files. You can also write metadata in your pdf files, [here](https://pypdf2.readthedocs.io/en/latest/user/metadata.html) is how.

### 1.2.2 Extracting Text from a PDF

For accessing the pages of the pdf file, you can do the following:

*Note: the `reader` variable is defined in the previous section.*

In [25]:
pages = reader.pages
len(pages)

6

In [26]:
print(pages[0].extract_text())

NOVEMBER 2020  |  VOL. 63  |  NO. 11  |   COMMUNICATIONS OF THE ACM    139Generative Adversarial Networks
By Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,  
David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua BengioDOI:10.1145/3422622
Abstract
Generative adversarial networks are a kind of artificial intel-
ligence algorithm designed to solve the generative model-
ing problem. The goal of a generative model is to study a 
collection of training examples and learn the probability 
distribution that generated them. Generative Adversarial 
Networks (GANs) are then able to generate more examples 
from the estimated probability distribution. Generative 
models based on deep learning are common, but GANs 
are among the most successful generative models (espe-
cially in terms of their ability to generate realistic high-
resolution images). GANs have been successfully applied 
to a wide variety of tasks (mostly in research settings) but 
continue to present unique challen

More information on the `extract_text()` method could be found [here](https://pypdf2.readthedocs.io/en/latest/modules/PageObject.html#PyPDF2._page.PageObject.extract_text).

#### Using a visitor


You can use *visitor-functions* to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment.

The function provided in argument `visitor_text` of method `extract_text()` has five arguments: current transformation matrix, text matrix, font-dictionary and font-size. In most cases the x and y coordinates of the current position are in index 4 and 5 of the current transformation matrix.

The font-dictionary may be None in case of unknown fonts. If not None it may e.g. contain key “/BaseFont” with value “/Arial,Bold”.

Warning: In complicated documents the calculated positions might be wrong.

The function provided in argument `visitor_operand_before` has four arguments: operand, operand-arguments, current transformation matrix and text matrix.

#### Example: Ignoring header and footer

In this example, we read the text of page 1 (with index = 0), but we ignore header (y < 670) and footer (y > 30).

In [28]:
page = reader.pages[0]

parts = []


def visitor_body(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 30 and y < 670:
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)


Abstract
Generative adversarial networks are a kind of artificial intel-
ligence algorithm designed to solve the generative model-
ing problem. The goal of a generative model is to study a 
collection of training examples and learn the probability 
distribution that generated them. Generative Adversarial 
Networks (GANs) are then able to generate more examples 
from the estimated probability distribution. Generative 
models based on deep learning are common, but GANs 
are among the most successful generative models (espe-
cially in terms of their ability to generate realistic high-
resolution images). GANs have been successfully applied 
to a wide variety of tasks (mostly in research settings) but 
continue to present unique challenges and research 
opportunities because they are based on game theory 
while most other approaches to generative modeling are 
based on optimization.
1. INTRODUCTION
Most current approaches to developing artificial intelli-
gence are based primarily on mach

### 1.2.3 Extracting Images

Every page of a PDF document can contain an arbitrary number of images. The names of the files may not be unique.
The following piece of code extracts the images of the second page of the pdf file and saves them in the current directory.

In [29]:
page = reader.pages[1]
count = 0

for image_file_object in page.images:
    with open(str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1

### 1.2.4 Reading PDF Annotations

We can also access the annotations that may be available in our pdf files. In the *GAN.pdf* file, We have 3 tentative annotations. The first page contains a sticky note containing some text, together with some highlighted text which also contains some text. There is another sticky note in the second page. We can access these texts like the following:

#### Sticky notes:

In [30]:
for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            subtype = annot.get_object()["/Subtype"]
            if subtype == "/Text":
                print(annot.get_object()["/Contents"])

This is the note in the first page.
This is the note in the second page.


#### Highlighted text:

In [31]:
for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            subtype = annot.get_object()["/Subtype"]
            if subtype == "/Highlight":
                print(annot.get_object()["/Contents"])

This is the note on the highlighted text in page one.


## 1.3 Unicode and emoji handling  [<a href='#destination1'>1, 2</a>] <a id='destination1_'></a>

Today’s programs need to be able to handle a wide variety of characters. Applications are often internationalized to display messages and output in a variety of user-selectable languages; the same program might need to output an error message in English, French, Japanese, Hebrew, or Russian. Web content can be written in any of these languages and can also include a variety of emoji symbols. Python’s string type uses the **Unicode** Standard for representing characters, which lets python programs work with all these different possible characters.

Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code. The Unicode specifications are continually revised and updated to add new languages and symbols.

In order to handle all kinds of text data that include characters in different languages, emojis etc, we need to be familiar with this standard.

<img src='images/emojis.png' style='height: 200px; float: left; align="center" ;margin-left: 115px'>

The Unicode standard describes how characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, the [actual number assigned](https://www.unicode.org/versions/Unicode15.0.0/#Summary) is less than that). In the standard and in this document, a code point is written using the notation U+265E to mean the character with value 0x265e (9,822 in decimal).

Like every character, every emoji has a unique Unicode assigned to it. When using emoji Unicodes with Python, replace **+** with **000** from the Unicode. And then prefix the Unicode with **\**.

For example, **U+1F605** will be used as **\U0001F605**. Here, **+** is replaced with **+** and **\** is prefixed with the Unicode.

Here are some examples:

In [18]:
print("grinning face: \U0001F600")
print("beaming face with smiling eyes: \U0001F601")
print("grinning face with sweat: \U0001F605")
print("rolling on the floor laughing: \U0001F923")
print("face with tears of joy: \U0001F602")
print("slightly smiling face: \U0001F642")
print("smiling face with halo: \U0001F607")
print("smiling face with heart-eyes: \U0001F60D")
print("zipper-mouth face: \U0001F910")
print("unamused face: \U0001F612")

grinning face: 😀
beaming face with smiling eyes: 😁
grinning face with sweat: 😅
rolling on the floor laughing: 🤣
face with tears of joy: 😂
slightly smiling face: 🙂
smiling face with halo: 😇
smiling face with heart-eyes: 😍
zipper-mouth face: 🤐
unamused face: 😒


### 1.3.1 Extracting all emojis from the text

You can easily extract all the emojis from the text using Python. It can be done using regular expressions. First, you need to install *regex* using `pip`:

In [19]:
pip install regex

Note: you may need to restart the kernel to use updated packages.


In [22]:
import regex

You can use `regex.findall()` method to find all the emojis from the text:

In [39]:
text = 'We 😍 want 😇 to 😅 extract 😁 every 😀 emoji 😒 in 😁 this 😂 string'

emojis = re.findall(r'[^\w\⁠s,. ]', text)

In [40]:
emojis

['😍', '😇', '😅', '😁', '😀', '😒', '😁', '😂']

### 1.3.2 Removing emoji from the text in python

You can remove all emojis from the text with the help of regular expressions in Python:

In [41]:
import regex

text = 'We 😍 want 😇 to 😅 extract 😁 every 😀 emoji 😒 in 😁 this 😂 string'

print(text)

We 😍 want 😇 to 😅 extract 😁 every 😀 emoji 😒 in 😁 this 😂 string


In [43]:
# Function to remove emoji from text:

def removeEmoji(text):
    regrex_pattern = regex.compile(pattern = "["
    u"\U0001F600-\U0001F64F" # emoticons
    u"\U0001F300-\U0001F5FF" # symbols & pictographs
    u"\U0001F680-\U0001F6FF" # transport & map symbols
    u"\U0001F1E0-\U0001F1FF" # flags (iOS)
    "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

In [44]:
removeEmoji(text)

'We  want  to  extract  every  emoji  in  this  string'

# 2. Numpy  [<a href='#destination2'>3</a>] <a id='destination2_'></a>

<img src='images/numpy.png' style='height: 120px; float: right; margin-left: 40px' >


NumPy (Numerical Python) is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. Numpy users include everyone from beginning coders to experienced researchers doing state-of-the-art scientific and industrial research and development. The Numpy API is used extensively in Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.

The Numpy library contains multidimensional array and matrix data structures (you’ll find out more about this later in this notebook). It provides **ndarray**, a homogeneous n-dimensional array object, with methods to efficiently operate on it. Numpy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.

## 2.1 Installation

In order to install Numpy, you can simply use `conda` or `pip`:

In [None]:
conda install numpy

# pip install numpy

Then, like any other library you need to import it first:

In [17]:
import numpy as np

## 2.2 Numpy arrays

An array is a central data structure of the Numpy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. They seem very much like python lists, but Numpy arrays are in fact faster and more compact than python lists.

One way we can initialize Numpy arrays is from python lists, using nested lists for two- or higher-dimensional data.

For example:

In [55]:
a = np.array([1, 2, 3, 4, 5, 6])

b = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

We can access the elements in the array using square brackets, just like lists. Indexing in Numpy also starts at 0 .

In [56]:
print(a[0])

1


Besides creating an array from a sequence of elements, you can easily create an array filled with 0’s or 1's:

In [58]:
np.zeros(2)

array([0., 0.])

In [59]:
np.ones(2)

array([1., 1.])

You can also create an array with a range of elements:

In [60]:
np.arange(4)

array([0, 1, 2, 3])

### 2.2.1 Concatenating and sorting elements


**Sorting** an element is simple with `np.sort()`. You can specify the axis, kind, and order when you call the function. Take `arr` array for example:

In [63]:
arr = np.array([2, 1, 5, 3, 7, 4, 6, 8])

You can quickly sort the numbers in ascending order with:

In [64]:
np.sort(arr)

array([1, 2, 3, 4, 5, 6, 7, 8])

**Concatenating** two arrays `a` and `b` could be done like this:

In [65]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

np.concatenate((a, b))

array([1, 2, 3, 4, 5, 6, 7, 8])

Or, if you start with these arrays:

In [66]:
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6]])

You can concatenate them with:

In [67]:
np.concatenate((x, y), axis=0)

array([[1, 2],
       [3, 4],
       [5, 6]])

### 2.2.2 The shape and size of an array


`ndarray.ndim` will tell you the number of axes, or dimensions, of the array.

`ndarray.size` will tell you the total number of elements of the array. This is the product of the elements of the array’s shape.

`ndarray.shape` will display a tuple of integers that indicate the number of elements stored along each dimension of the array. If, for example, you have a 2-D array with 2 rows and 3 columns, the shape of your array is (2, 3).

For example, if you create this array:

In [68]:
array_example = np.array([[[0, 1, 2, 3],
                           [4, 5, 6, 7]],

                          [[0, 1, 2, 3],
                           [4, 5, 6, 7]],

                          [[0 ,1 ,2, 3],
                           [4, 5, 6, 7]]])

You will have:

In [75]:
print(f'array_example.ndim: {array_example.ndim}')
print(f'array_example.size: {array_example.size}')
print(f'array_example.shape: {array_example.shape}')

array_example.ndim: 3
array_example.size: 24
array_example.shape: (3, 2, 4)


#### Reshaping arrays:

Using `arr.reshape()` will give a new shape to an array without changing the data. Just remember that when you use the reshape method, the array you want to produce needs to have the same number of elements as the original array.

For example:

In [83]:
a = np.arange(6)
print(f'a:\n {a}')

reshaped_a = a.reshape(3, 2)
print (f'\nreshaped_a:\n {reshaped_a}')

a:
 [0 1 2 3 4 5]

reshaped_a:
 [[0 1]
 [2 3]
 [4 5]]


### 2.2.3 Indexing and slicing

You can index and slice Numpy arrays in the same ways you can slice Python lists:

In [88]:
data = np.array([1, 2, 3])

In [89]:
data[1]

2

In [90]:
data[0:2]

array([1, 2])

In [91]:
data[1:]

array([2, 3])

In [92]:
data[-2:]

array([2, 3])

If you want to select values from your array that fulfill certain conditions, it’s straightforward with Numpy.

For example, if you start with this array:

In [93]:
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

You can easily print all of the values in the array that are less than 5:

In [94]:
print(a[a < 5])

[1 2 3 4]


You can also select, for example, numbers that are equal to or greater than 5, and use that condition to index an array:

In [97]:
five_up = (a >= 5)
print(a[five_up])

[ 5  6  7  8  9 10 11 12]


Or you can select elements that satisfy two conditions using the & and | operators:

In [98]:
c = a[(a > 2) & (a < 11)]
print (c)

[ 3  4  5  6  7  8  9 10]


### 2.2.4 Basic array operations

Once you’ve created your arrays, you can start to work with them. Let’s say, for example, that you’ve created two arrays, one called “data” and one called “ones”.

You can add the arrays together with the plus sign:

In [100]:
data = np.array([1, 2])
ones = np.ones(2, dtype=int)

data + ones

array([2, 3])

You can, of course, do more than just addition:

In [101]:
data - ones
# data * data
# data / data

array([0, 1])

Basic operations are simple with Numpy. If you want to find the sum of the elements in an array, you’d use sum(). This works for 1D arrays, 2D arrays, and arrays in higher dimensions:

In [102]:
a = np.array([1, 2, 3, 4])

a.sum()

10

To add the rows or the columns in a 2D array, you would specify the axis.

If you start with this array:

In [103]:
b = np.array([[1, 1], [2, 2]])

You can sum over the axis of rows with:

In [104]:
b.sum(axis=0)

array([3, 3])

You can sum over the axis of columns with:

In [105]:
b.sum(axis=1)

array([2, 4])

### 2.2.5 Matrices

You can pass python lists of lists to create a 2-D array (or *matrix*) to represent them in Numpy.

Indexing and slicing operations are useful when you’re manipulating matrices:

In [106]:
data = np.array([[1, 2], [3, 4], [5, 6]])

data[0, 1]

2

In [107]:
data[1:3]

array([[3, 4],
       [5, 6]])

In [108]:
data[0:2, 0]data = np.array([[1, 2], [5, 3], [4, 6]])

array([1, 3])

You can aggregate matrices the same way you aggregate vectors:

In [109]:
data.max()
# data.min()

6

In [110]:
data.sum()

21

You can aggregate all the values in a matrix and you can aggregate them across columns or rows using the axis parameter. To illustrate this point, let’s look at a slightly modified dataset:

In [112]:
data = np.array([[1, 2], [5, 3], [4, 6]])

data.max(axis=0)

array([5, 6])

In [113]:
data.max(axis=1)

array([2, 5, 6])

### 2.2.6 Getting unique items and counts

You can find the unique elements in an array easily with np.unique.

For example, if you start with this array:

In [116]:
a = np.array([11, 11, 12, 13, 14, 15, 16, 17, 12, 13, 11, 14, 18, 19, 20])

You can use `np.unique` to print the unique values in your array:

In [119]:
unique_values = np.unique(a)
print(unique_values)

[11 12 13 14 15 16 17 18 19 20]


To get the indices of unique values in a Numpy array (an array of first index positions of unique values in the array), just pass the `return_index` argument in `np.unique()` as well as your array:

In [121]:
unique_values, indices_list = np.unique(a, return_index=True)
print(indices_list)

[ 0  2  3  4  5  6  7 12 13 14]


This also works with 2D arrays! [here](https://numpy.org/doc/stable/user/absolute_beginners.html#how-to-get-unique-items-and-counts) is how.

### 2.2.7 Reversing an array

Numpy’s `np.flip()` function allows you to flip, or reverse, the contents of an array along an axis. When using `np.flip()`, specify the array you would like to reverse and the axis. If you don’t specify the axis, Numpy will reverse the contents along all of the axes of your input array.

#### Reversing a 1D array

If you begin with a 1D array like `arr`, you can reverse it with `flip` like this:

In [123]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])

reversed_arr = np.flip(arr)

reversed_arr

array([8, 7, 6, 5, 4, 3, 2, 1])

#### Reversing a 2D array

A 2D array works much the same way. If you start with a 2D array like `arr_2d`, You can reverse the content in all of its rows and all of its columns like this:

In [124]:
arr_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

reversed_arr = np.flip(arr_2d)

reversed_arr

array([[12, 11, 10,  9],
       [ 8,  7,  6,  5],
       [ 4,  3,  2,  1]])

You can easily reverse only the rows or columns with:

In [127]:
reversed_arr_rows = np.flip(arr_2d, axis=0)

reversed_arr_columns = np.flip(arr_2d, axis=1)

# 3. Pandas [<a href='#destination3'>4</a>] <a id='destination3_'></a>

<img src='images/pandas.png' style='height: 150px; float: right; margin-left: 10px' >

**Pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

Pandas is well suited for many different kinds of data:

   - Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

   - Ordered and unordered (not necessarily fixed-frequency) time series data.

   - Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

   - Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure
   
The two primary data structures of pandas, **Series** (1-dimensional) and **DataFrame** (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, **DataFrame** provides everything that R’s `data.frame` provides and much more. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

You can install pandas with `conda` or `pip` in your cmd/terminal:

In [None]:
conda install numpy
# pip install pandas

To load the pandas package and start working with it, import the package. The community agreed alias for pandas is `pd`, so loading pandas as `pd` is assumed standard practice for all of the pandas documentation.

In [2]:
import pandas as pd

## 3.1 Pandas data table representation

<img src='images/df.png' style='height: 350px; float: right; margin-right: 300px' >


Let's say you want to store some passenger data of Titanic. For a number of passengers, you know the name (characters), age (integers) and sex (male/female) data. To manually store the data in a table, you should create a *DataFrame*. When using a python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame:

In [54]:
df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    })

In [55]:
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R.

- The table has 3 columns, each of them with a column label. The column labels are respectively `Name`, `Age` and `Sex`.

- The column Name consists of textual data with each value a string, the column Age are numbers and the column Sex is textual data.

In spreadsheet software, the table representation of our data would look very similar:

<img src='images/spreadsheet.png' style='height: 350px; float: right; margin-right: 200px' >

You can get the shape of any dataframe like this:

In [56]:
df.shape

(3, 3)

## 3.2 Pandas Series

Each column in a DataFrame is a **Series**.

<img src='images/series.png' style='height: 300px; float: right; margin-right: 400px' >

Let's say we are just interested in working with the data in the column `Age`. We can access the Age *Series* like this:

In [138]:
df["Age"]

0    22
1    35
2    58
Name: Age, dtype: int64

We can create a Series from scratch as well:

In [139]:
ages = pd.Series([22, 35, 58], name="Age")

A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.

## 3.3 Reading and writing tabular data with Pandas

In this section, we start to work on some real data from twitter. It's taken from [TweetsCOV19 dataset](https://data.gesis.org/tweetscov19/); a semantically annotated corpus of Tweets about the COVID-19 pandemic.

Download the file https://zenodo.org/record/4593502/files/TweetsCOV19_052020.tsv.gz, and after extraction, store it in the ./data directory.

It's a tweets dataset containing tab seperated values (tsv). For tsv or csv (comma seperated values) files, we use `pandas.read_csv()` method in order to read the data:

In [2]:
tweets = pd.read_csv('./data/data', sep='\t', header=None, quoting= 3)

The first argument of this method specifies the address of the file we read, and the second one specifies the type of the data (tsv). We will go back to what the third on is, in a few cells.

You can check all the arguments that pd.read_csv can take [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

You can investigate the `tweets` dataframe and see how it looks:

In [8]:
tweets

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,1 -1,null;,Opinion Next2blowafrica thoughts,null;
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,1 -1,null;,null;,null;
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,2 -1,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,1 -1,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,1 -4,null;,null;,null;
...,...,...,...,...,...,...,...,...,...,...,...,...
1922400,1267207472424660992,ae1b1e6bf2a30cd0e1047ddd0baf5ad0,Sun May 31 21:32:59 +0000 2020,15,45,0,0,spotify:Spotify:-0.9407337067771776;wifi:Wi-Fi...,2 -1,null;,null;,null;
1922401,1267207883487354881,0e4323d01d164b9eb6e33f35564c7e25,Sun May 31 21:34:37 +0000 2020,43,931,0,0,china:China:-2.113921624336916;death penalty:C...,1 -2,null;,null;,null;
1922402,1267209309559173122,00fc2c96e4012e27a6eee351723ab461,Sun May 31 21:40:17 +0000 2020,256,451,0,0,null;,2 -1,null;,null;,null;
1922403,1267212987938545667,0f99a3b8b0d490f062215575d074518b,Sun May 31 21:54:54 +0000 2020,1467,1505,0,0,omg:OMG_%28Usher_song%29:-2.580063760606172;,2 -1,lsddrq,null;,null;


As you can see, it's a pretty large dataset with 1922405 rows and 12 columns. It still doesn't have the right column labels. We can set those labels according to the metadata of the dataset:

In [3]:
tweets.columns = ['tweet_id', 'username', 'timestamp', 'followers', 'friends', 'retweets', 'favorites', 'entities', 'sentiment', 'mentions', 'hashtags', 'urls']

In [11]:
tweets

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,sentiment,mentions,hashtags,urls
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,1 -1,null;,Opinion Next2blowafrica thoughts,null;
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,1 -1,null;,null;,null;
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,2 -1,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,1 -1,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,1 -4,null;,null;,null;
...,...,...,...,...,...,...,...,...,...,...,...,...
1922400,1267207472424660992,ae1b1e6bf2a30cd0e1047ddd0baf5ad0,Sun May 31 21:32:59 +0000 2020,15,45,0,0,spotify:Spotify:-0.9407337067771776;wifi:Wi-Fi...,2 -1,null;,null;,null;
1922401,1267207883487354881,0e4323d01d164b9eb6e33f35564c7e25,Sun May 31 21:34:37 +0000 2020,43,931,0,0,china:China:-2.113921624336916;death penalty:C...,1 -2,null;,null;,null;
1922402,1267209309559173122,00fc2c96e4012e27a6eee351723ab461,Sun May 31 21:40:17 +0000 2020,256,451,0,0,null;,2 -1,null;,null;,null;
1922403,1267212987938545667,0f99a3b8b0d490f062215575d074518b,Sun May 31 21:54:54 +0000 2020,1467,1505,0,0,omg:OMG_%28Usher_song%29:-2.580063760606172;,2 -1,lsddrq,null;,null;


As you can see, now all the 12 columns of the dataframe have labels. The first column in the left, containing numbers in bold format is not actually a column, it just contains the indices of the rows.

Here is some more information on the data columns:

- Tweet Id: Long.
- Username: String. Encrypted for privacy issues.
- Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).
- #Followers: Integer.
- #Friends: Integer.
- #Retweets: Integer.
- #Favorites: Integer.
- Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".
- Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").
- Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".
- Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".
- URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"


You can access the first n rows of the data using `head()` method:

In [16]:
head = tweets.head(5)
head

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,sentiment,mentions,hashtags,urls
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,1 -1,null;,Opinion Next2blowafrica thoughts,null;
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,1 -1,null;,null;,null;
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,2 -1,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,1 -1,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,1 -4,null;,null;,null;


In order to write tabular data to a csv file, you can use `to_csv()` method. For writing the `head` dataframe that we created above, it's like this:

In [18]:
head.to_csv('./head.csv')

The argument specifies the address and the name of the file that we create for writing the data. You can check other arguments that `to_csv()` can take [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).

## 3.4 Selecting a subset of data

<img src='images/df_selection.png' style='height: 200px; float: right; margin-right: 100px' >

You can extract subsets of a dataframes and work with the new data. We have already mentioned how this is done for extracting one column (a series) of a dataframe in the previous section. It could also be done for multiple columns like this:

In [11]:
username_followers = tweets[["username", "followers"]]

username_followers

# username_followers.shape

Unnamed: 0,username,followers
0,fa5fd446e778da0acba3504aeab23da5,29697
1,547501e9cc84b8148ae1b8bde04157a4,799
2,840ac60dab55f6b212dc02dcbe5dfbd6,586
3,37c68a001198b5efd4a21e2b68a0c9bc,237
4,8c3620bdfb9d2a1acfdf2412c9b34e06,423
...,...,...
1922400,ae1b1e6bf2a30cd0e1047ddd0baf5ad0,15
1922401,0e4323d01d164b9eb6e33f35564c7e25,43
1922402,00fc2c96e4012e27a6eee351723ab461,256
1922403,0f99a3b8b0d490f062215575d074518b,1467


### 3.4.1 Indexing

You can access certain parts of the dataframe based on the indices of rows or columns. In order to get some rows based on indices, you can do it using `[]`. For getting rows 10 to 15 of the tweets dataframe, it's like this:

In [16]:
tweets[10:15]

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,sentiment,mentions,hashtags,urls
10,1255985760790069251,c8f0b58eb5105e2318e15ff17b9e4250,Thu Apr 30 22:21:55 +0000 2020,722,572,4,33,fred guttenberg:Fred_Guttenberg:-1.35898888441...,1 -1,fred_guttenberg GovWhitmer,null;,null;
11,1255986390803890179,d2c476ec35a8274be8a7e16ccd3cee49,Thu Apr 30 22:24:25 +0000 2020,32756,3035,15,7,kaduna:Kaduna:-1.6483966010617883;kano:Kano:-2...,1 -1,MubarakBala,FreeMubarakBala,null;
12,1255987424003420160,33ceaa2e6b5fd8b8f565fbad72caa5d6,Thu Apr 30 22:28:31 +0000 2020,554,313,0,0,null;,1 -1,null;,null;,null;
13,1255990446582575109,a97b14453e6bff7957c0e79058ddcb32,Thu Apr 30 22:40:32 +0000 2020,6,2,1,2,mayday:Mayday_%28Canadian_TV_series%29:-2.4274...,1 -1,null;,null;,null;
14,1255993051161546752,ed54b52688d4a550d43e68d843f2cb2d,Thu Apr 30 22:50:53 +0000 2020,183,427,1,0,leukemia:Leukemia:-1.5352150015828259;all blac...,3 -3,null;,adoptables blackcats beautifulcats rescue cats...,null;


<img src='images/row_col.png' style='height: 200px; float: right; margin-right: 100px' >

You can also specify rows and columns at the same time, when indexing your dataframe, like the above picture. For example, if you want to get the rows 25 to 30, from the 4th to 7th column (excluding the 7th itself), you can do it like this:

In [42]:
df = tweets.iloc[25:30, 4:7]
df

Unnamed: 0,friends,retweets,favorites
25,333,0,0
26,493,0,0
27,1830,0,0
28,110,3,5
29,859,29,47


### 3.4.2 Filtering specific rows

We can also filter our data based on some conditions on the rows. Let's say, for example, in the `username_followers` dataframe we want to get all the rows whose followers are above the threshold of 2000. It can be done like this:

In [13]:
username_followers[username_followers["followers"] > 2000]

Unnamed: 0,username,followers
0,fa5fd446e778da0acba3504aeab23da5,29697
6,916dec763c84916c929bb257ff96187d,70185
11,d2c476ec35a8274be8a7e16ccd3cee49,32756
15,8649803dcbe651e7e7b7ff3ba2c324f0,6598
19,85bc7878f69dcc52b76f707a3cb957aa,4542
...,...,...
1922381,f67bd1f0c41cb218ad9e143fc8dda6a8,2755
1922383,b3838a18c0363d30f8324872c318b975,4153
1922385,991b55d55fe82c5e478e96c2b0a733e4,12000
1922386,8ee3368da012755f5353f5669d83acb1,37546


## 3.5 Adding/removing columns

You can also remove a column from a dataframe, or add some new rows to it. Let's say, for example, we want to remove the `sentiment` column in tweets dataframe, and instead, add two `sentiment_pos` and `sentiment_neg` columns to it; in a way that `sentiment_pos` is the first (positive) number in the `sentiment` column of each row, and `sentiment_neg` is the second/negative number.

In order to do that, we can first save all the positive and negative values of the `sentiment` column in the two `pos` and `neg` lists, and then make two new columns in the dataframe from those lists. Finally, we will delete the `sentiment` column:

In [4]:
# Putting the pos/neg values in the lists:

pos = []
neg = []

for i in tweets['sentiment']:
    pos.append(i.split()[0])
    neg.append(i.split()[1])

# Making the new rows from the lists:

tweets['sentiment_pos'] = pos
tweets['sentiment_neg'] = neg

# Deleting the sentiment column:

del tweets['sentiment']

In [5]:
tweets.head(5)

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,mentions,hashtags,urls,sentiment_pos,sentiment_neg
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,null;,Opinion Next2blowafrica thoughts,null;,1,-1
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,null;,null;,null;,1,-1
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...,2,-1
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...,1,-1
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,null;,null;,null;,1,-4


## 3.6 Merging, joining and concatenating dataframes

You can merge, join or concatenate dataframes. We will go through all these operations step by step. 


### 3.6.1 Concatenation

Let's say you have three dataframes, each with 5 rows like this:

In [56]:
df1 = tweets.iloc[:4, :7]
df1

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0


In [57]:
df2 = tweets.iloc[500:504, :7]
df2

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites
500,1256309827594121217,11312d0c0c9dc70882c068f4cdacf60f,Fri May 01 19:49:38 +0000 2020,58,241,0,0
501,1256309852063911936,199e1000aac84790246d76bb4e63d075,Fri May 01 19:49:44 +0000 2020,366,159,9,36
502,1256310427388018688,f723076b285c935d1b3b551f1c4de44b,Fri May 01 19:52:01 +0000 2020,1655,320,0,0
503,1256311962499026944,5abd63afc16692544ea01818f9b054ad,Fri May 01 19:58:07 +0000 2020,2922,928,0,0


In [58]:
df3 = tweets.iloc[1000:1004, :7]
df3

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites
1000,1256699767846866944,30432785cddb22a39ddd455a22db83b4,Sat May 02 21:39:07 +0000 2020,1124,1165,0,0
1001,1256700006918168577,cfe0272a5e1979777c47e9c8b956ac75,Sat May 02 21:40:04 +0000 2020,24,543,0,0
1002,1256700166309961728,9481d0aa694237dacd295df55045c332,Sat May 02 21:40:42 +0000 2020,900,724,0,0
1003,1256702775154466818,85f3dc93dc247e1fe2419cd08e20c3aa,Sat May 02 21:51:04 +0000 2020,31280,26083,0,0


You can **concatenate** them with `concat()` function like this:

In [59]:
df = pd.concat([df1, df2, df3])
df

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0
500,1256309827594121217,11312d0c0c9dc70882c068f4cdacf60f,Fri May 01 19:49:38 +0000 2020,58,241,0,0
501,1256309852063911936,199e1000aac84790246d76bb4e63d075,Fri May 01 19:49:44 +0000 2020,366,159,9,36
502,1256310427388018688,f723076b285c935d1b3b551f1c4de44b,Fri May 01 19:52:01 +0000 2020,1655,320,0,0
503,1256311962499026944,5abd63afc16692544ea01818f9b054ad,Fri May 01 19:58:07 +0000 2020,2922,928,0,0
1000,1256699767846866944,30432785cddb22a39ddd455a22db83b4,Sat May 02 21:39:07 +0000 2020,1124,1165,0,0
1001,1256700006918168577,cfe0272a5e1979777c47e9c8b956ac75,Sat May 02 21:40:04 +0000 2020,24,543,0,0


As you can see, the indices are the same as the indices of the initial dataframes. If you want the final dataframe to have new indices, you can use `ignore_index = True` as an argument to the `concat()` function:

In [60]:
df = pd.concat([df1, df2, df3], ignore_index = True)
df

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0
4,1256309827594121217,11312d0c0c9dc70882c068f4cdacf60f,Fri May 01 19:49:38 +0000 2020,58,241,0,0
5,1256309852063911936,199e1000aac84790246d76bb4e63d075,Fri May 01 19:49:44 +0000 2020,366,159,9,36
6,1256310427388018688,f723076b285c935d1b3b551f1c4de44b,Fri May 01 19:52:01 +0000 2020,1655,320,0,0
7,1256311962499026944,5abd63afc16692544ea01818f9b054ad,Fri May 01 19:58:07 +0000 2020,2922,928,0,0
8,1256699767846866944,30432785cddb22a39ddd455a22db83b4,Sat May 02 21:39:07 +0000 2020,1124,1165,0,0
9,1256700006918168577,cfe0272a5e1979777c47e9c8b956ac75,Sat May 02 21:40:04 +0000 2020,24,543,0,0


### 3.6.2 Resetting indices

You can reset the indices of a dataframe whenever you want. Let's say we want to reset the indices of the second dataframe (`df2`) for example. You can do it like this:

In [61]:
df2 = df2.reset_index()
df2

Unnamed: 0,index,tweet_id,username,timestamp,followers,friends,retweets,favorites
0,500,1256309827594121217,11312d0c0c9dc70882c068f4cdacf60f,Fri May 01 19:49:38 +0000 2020,58,241,0,0
1,501,1256309852063911936,199e1000aac84790246d76bb4e63d075,Fri May 01 19:49:44 +0000 2020,366,159,9,36
2,502,1256310427388018688,f723076b285c935d1b3b551f1c4de44b,Fri May 01 19:52:01 +0000 2020,1655,320,0,0
3,503,1256311962499026944,5abd63afc16692544ea01818f9b054ad,Fri May 01 19:58:07 +0000 2020,2922,928,0,0


As you can see, the previous indices are now a new column in the dataframe. You can simply remove them with `del`:

In [63]:
del df2['index']
df2

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites
0,1256309827594121217,11312d0c0c9dc70882c068f4cdacf60f,Fri May 01 19:49:38 +0000 2020,58,241,0,0
1,1256309852063911936,199e1000aac84790246d76bb4e63d075,Fri May 01 19:49:44 +0000 2020,366,159,9,36
2,1256310427388018688,f723076b285c935d1b3b551f1c4de44b,Fri May 01 19:52:01 +0000 2020,1655,320,0,0
3,1256311962499026944,5abd63afc16692544ea01818f9b054ad,Fri May 01 19:58:07 +0000 2020,2922,928,0,0


### 3.6.3 Joining

We can use `DataFrame.join()` in order to join our main dataframe columns with another dataframe columns.

Let's say we have two dataframes `main_df` and `other_df` like this:

In [2]:
main_df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

main_df

Unnamed: 0,key,A
0,K0,A0
1,K1,A1
2,K2,A2
3,K3,A3
4,K4,A4
5,K5,A5


In [3]:
other_df = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})

other_df

Unnamed: 0,key,B
0,K0,B0
1,K1,B1
2,K2,B2


We can join them in a way that both dataframe keys are present in the final dataframe, like this:

In [5]:
main_df.join(other_df, lsuffix='_caller', rsuffix='_other')

Unnamed: 0,key_caller,A,key_other,B
0,K0,A0,K0,B0
1,K1,A1,K1,B1
2,K2,A2,K2,B2
3,K3,A3,,
4,K4,A4,,
5,K5,A5,,


Or in a way that the indices are the keys of the main dataframe, like this:

*Note: The empty values of `other_df` get `NaN` values in the final dataframe.*

In [6]:
main_df.set_index('key').join(other_df.set_index('key'))

Unnamed: 0_level_0,A,B
key,Unnamed: 1_level_1,Unnamed: 2_level_1
K0,A0,B0
K1,A1,B1
K2,A2,B2
K3,A3,
K4,A4,
K5,A5,


### 3.6.4 Merging

What `DataFrame.merge()` does is that it merges dataframes or series with a database-style join.

The merging is done on columns or indices. If merging columns on columns, the dataframe indices will be ignored. Otherwise if merging indices on indices or indices on a column or columns, the index will be passed on. 

Let's say we have two dataframes, `df1` and `df2` like this:

In [9]:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})

df1

Unnamed: 0,lkey,value
0,foo,1
1,bar,2
2,baz,3
3,foo,5


In [10]:
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

df2

Unnamed: 0,rkey,value
0,foo,5
1,bar,6
2,baz,7
3,foo,8


If we merge `df1` and `df2` on the `lkey` and `rkey` columns, while the value columns have the default suffixes, _x and _y, appended, it will be like this:

In [11]:
df1.merge(df2, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,value_x,rkey,value_y
0,foo,1,foo,5
1,foo,1,foo,8
2,foo,5,foo,5
3,foo,5,foo,8
4,bar,2,bar,6
5,baz,3,baz,7


If we merge dataframes `df1` and `df2` with specified left and right suffixes appended to any overlapping columns, it will be like this:

In [12]:
df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=('_left', '_right'))

Unnamed: 0,lkey,value_left,rkey,value_right
0,foo,1,foo,5
1,foo,1,foo,8
2,foo,5,foo,5
3,foo,5,foo,8
4,bar,2,bar,6
5,baz,3,baz,7


## 3.7 Applying functions to dataframe columns

It is possible to define functions of your own choice and apply them to all of the items of a column in a dataframe. It is done using `apply()` method.

Let's say we want to split the hashtags in the `hashtags` column and keep them in lists, instead of strings. Right now, they are present in the form of strings, in which every hashtag is seperated from the other ones with a space character. We can define a function `f1` to split such strings and convert them to lists and then apply that function to all items of the column:

In [6]:
def f1(cell):
    if cell == 'null;' or type(cell) == float:
        cell = ['']
    else:
        cell = cell.split()
    return cell

In [7]:
tweets['hashtags'] = tweets['hashtags'].apply(f1)

In [8]:
tweets.head(5)

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,mentions,hashtags,urls,sentiment_pos,sentiment_neg
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,null;,"[Opinion, Next2blowafrica, thoughts]",null;,1,-1
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,null;,[],null;,1,-1
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,null;,[],https://www.bbc.com/news/uk-england-beds-bucks...,2,-1
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,null;,[],https://lockdownsceptics.org/2020/04/30/latest...,1,-1
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,null;,[],null;,1,-4


In [27]:
print (tweets['hashtags'][0])
print (tweets['hashtags'][1])

['Opinion', 'Next2blowafrica', 'thoughts']
['']


The function splits the strings to lists, and if the hashtag item in a row is not available or has got a float value (there are some in our data), it will be replaced with an empty list.

We do the same thing also to the `mentions`, `entities` and `urls` columns, because we need their values in lists in order to do the things we are going to do with the data later in this notebook:

In [9]:
tweets['mentions'] = tweets['mentions'].apply(f1)

In [None]:
# New function for applying to entities and urls:

def f2(cell, column):
    if cell == 'null;' or type(cell) == float:
        cell = ['']
    else:
        if column == 'entities':
            splitted = cell.split(';')
        if column == 'urls':
            splitted = cell.split(':-:')
        del splitted[-1]
        cell = splitted
        
    return cell


# applying f2 to entities and urls columns:

tweets['entities'] = tweets['entities'].apply(f2, column = 'entities')
tweets['urls'] = tweets['urls'].apply(f2, column = 'urls')

In [12]:
tweets.head(5)

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,mentions,hashtags,urls,sentiment_pos,sentiment_neg
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,[],[],"[Opinion, Next2blowafrica, thoughts]",[],1,-1
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,[],[],[],[],1,-1
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,[],[],[],[https://www.bbc.com/news/uk-england-beds-buck...,2,-1
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,[],[],[],[https://lockdownsceptics.org/2020/04/30/lates...,1,-1
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,"[i hate u:I_Hate_U:-1.8786140035817729, quaran...",[],[],[],1,-4


The function applied to `entities` and `urls` is a bit different;
- entities values are seperated with `;` character, so instead of space character, we need to split them with `;`.
- urls values are seperated with `:-:` characters, so we need to split them with that.

## 3.8 groupby(), squeeze() and idmax()

Here we introduce 3 more useful functions for working with dataframs.

### 3.8.1 DataFrame.goupby()

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

As the first example, take this dataframe:

In [71]:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})

df

Unnamed: 0,Animal,Max Speed
0,Falcon,380.0
1,Falcon,370.0
2,Parrot,24.0
3,Parrot,26.0


If we want to get the mean of each animal's max speed, we can do it like this:

In [72]:
df.groupby(['Animal']).mean()

Unnamed: 0_level_0,Max Speed
Animal,Unnamed: 1_level_1
Falcon,375.0
Parrot,25.0


You will get to work with more examples of this function later in this notebook.

### 3.8.2 DataFrame.squeeze()

It is used to squeeze 1 dimensional axis objects into scalars;

- Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged.

- This method is most useful when you don’t know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.

Example:

In [85]:
primes = pd.Series([2, 3, 5, 7])
even_primes = primes[primes % 2 == 0]
even_primes

0    2
dtype: int64

In [86]:
even_primes.squeeze()

2

Squeezing objects with more than one value in every axis does nothing:

In [87]:
odd_primes = primes[primes % 2 == 1]
odd_primes

1    3
2    5
3    7
dtype: int64

In [88]:
odd_primes.squeeze()

1    3
2    5
3    7
dtype: int64

Squeezing when working with DataFrames:

In [89]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
df

Unnamed: 0,a,b
0,1,2
1,3,4


Slicing a single column will produce a dataframe with the columns having only one value:

In [90]:
df_a = df[['a']]
df_a

Unnamed: 0,a
0,1
1,3


So the columns can be squeezed down, resulting in a Series:

In [91]:
df_a.squeeze('columns')

0    1
1    3
Name: a, dtype: int64

### 3.8.3 DataFrame.idxmax()

It returns index of first occurrence of maximum over requested axis. NA/null values are excluded.

As an example, take this dataframe:

In [93]:
df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],
                   'co2_emissions': [37.2, 19.66, 1712]},
                   index=['Pork', 'Wheat Products', 'Beef'])

df

Unnamed: 0,consumption,co2_emissions
Pork,10.51,37.2
Wheat Products,103.11,19.66
Beef,55.48,1712.0


By default, `df.idmax()` returns the index for the maximum value in each column:

In [94]:
df.idxmax()

consumption      Wheat Products
co2_emissions              Beef
dtype: object

To return the index for the maximum value in each row, you can use `axis="columns"`:

In [95]:
df.idxmax(axis="columns")

Pork              co2_emissions
Wheat Products      consumption
Beef              co2_emissions
dtype: object

### 3.8.4 Example: Making ` Users` dataframe

Let's say, we want to make a dataframe out of `tweets`, that gives us some useful information about the users in our main dataframe.

We want this dataframe to contain all usernames in one column, the maximum number of followers of each username (in the whole data) in another column, and the maximum number of friends of each username (in the whole data) in the other column. And finally, we want all these usernames sorted by the maximum followers.

Based on what we have learned so far, it could be done like this:

*Note: This may take around 90 seconds to run.*

In [13]:
# Getting the maximum number of followers for each user:

followers_max = tweets.loc[tweets.groupby('username')['followers'].idxmax()]
followers_max = followers_max.reset_index()

# Getting the maximum number of friends for each user:

friends_max = tweets.loc[tweets.groupby('username')['friends'].idxmax()]
friends_max = friends_max.reset_index()

# Putting them all in the new dataframe:

users = pd.DataFrame()
users['username'] = friends_max['username']
users['followers_max'] = followers_max['followers']
users['friends_max'] = friends_max['friends']

# Sorting the rows ascendingly, based on followers_max:

users = users.sort_values('followers_max', ascending = False).reset_index()

# Deleting the index columns, since we have reset the indices:

del users['index']

users

Unnamed: 0,username,followers_max,friends_max
0,c1d4d177b4028f2b6ea90a3617c32fb6,117926717,606040
1,0b64e075d55e5221457d3e22ba3dcc14,111636059,299530
2,7cd534d396546a50ddd2dea9ee7f9145,108555597,224
3,a075253a703c963c96f819be90e82a67,81495144,123005
4,75224fc65ae453fe9ec3ca855cd8619b,80751709,46
...,...,...,...
1122828,5713bd926a702800a5b24131d42fdc9b,0,38
1122829,3cab3ac9ba90bdce7bef2ffb17820fa9,0,28
1122830,b19b8939c414154978b1b996b27fc0c6,0,103
1122831,3caad3ca8cc57b5eb5321530cc8fd4b8,0,25


As you can see, the number of rows of the `users` dataframe is hundreds of thousands less that those of `tweets`. It's because in the `tweets` dataframe, every username could have many tweets, but in `users`, each username has occured only once. So, the number of the rows in `users` dataframe is in fact the number of users who have twitted in the `tweets` data.

The user with the highest number of followers is Barack Obama, according to [this link](https://en.wikipedia.org/wiki/List_of_most-followed_Twitter_accounts#:~:text=This%20list%20contains%20the%20top,over%20100%20million%20followers%20each.)!



## 3.9 Some other examples of working with `tweets` dataframe

In this section, we are going to play with the data a little more, and will introduce some other useful functions along the way.

### 3.9.1 tweets_df

We want to make a dataframe called **tweets_df**, in a way that it contains other columns like `identifier`, `timestamp`, `followers`, `friends`, `retweets`, `favorites`, `sentiment_pos` and `sentiment_neg`, together with a new column as `user_id`, which has username indices based on the `users` dataframe that we made in the previous section, and is in fact replaced with `username` column. we also want this new **tweets_df** be sorted and indexed by the `timestamps` values.

It would be made like this:

*Note: This may take around 35 seconds to run.*

In [16]:
#First we get a copy of the tweets dataframe and do our manipulations on that:

tweets_df = tweets.copy()

# Then we rename identifier and username columns:

tweets_df.rename(columns = {'tweet_id':'identifier', 'username':'user_id'}, inplace = True)

# We do a reset_index() on users dataframe, so that its indices will have a new column "index", and we delete its
# followers_max and friends_max columns:

temp_users = users.reset_index()
del temp_users['followers_max']
del temp_users['friends_max']

# Using pd.merge(), we assign indices in users dataframe to their equivalent usernames:

merged = pd.merge(temp_users, tweets_df, left_on='username', right_on='user_id', how='left').drop('user_id', axis=1)
merged = merged.rename(columns={"index": "user_id"})
del merged['username']

# For sorting timestamps, first we convert every timestamp to an easily sortable format:

modified_timestamps = pd.to_datetime(merged['timestamp'], format = '%a %b %d %X %z %Y')

merged['modified_timestamps'] = modified_timestamps

# We sort the rows based on new timestamps:

merged = merged.sort_values(by=['modified_timestamps']).reset_index()
del merged['modified_timestamps']
del merged['index']

# We assign the right labels to columns:

cols = ['identifier'] + ['user_id'] + [col for col in merged if (col != 'identifier' and col != 'user_id')]
merged = merged[cols]
tweets_sorted = merged
tweets_df = tweets_sorted.copy()

# Finally, we delete the unnecessary columns:

del tweets_df['entities']
del tweets_df['mentions']
del tweets_df['hashtags']
del tweets_df['urls']
del merged

tweets_df

Unnamed: 0,identifier,user_id,timestamp,followers,friends,retweets,favorites,sentiment_pos,sentiment_neg
0,1255980246559408128,47593,Thu Apr 30 22:00:00 +0000 2020,26865,80,1,0,1,-3
1,1255980248728035329,14973,Thu Apr 30 22:00:00 +0000 2020,120440,69187,78,90,1,-2
2,1255980247683674113,378,Thu Apr 30 22:00:00 +0000 2020,6149624,462,32,104,1,-1
3,1255980246370676737,9854,Thu Apr 30 22:00:00 +0000 2020,200432,1880,14,50,2,-3
4,1255980246995570692,219979,Thu Apr 30 22:00:00 +0000 2020,1714,112,37,46,2,-1
...,...,...,...,...,...,...,...,...,...
1922400,1267214225270685696,793276,Sun May 31 21:59:49 +0000 2020,118,419,0,0,1,-4
1922401,1267214225283231744,756893,Sun May 31 21:59:49 +0000 2020,149,341,0,0,2,-3
1922402,1267214229469310978,323429,Sun May 31 21:59:50 +0000 2020,1465,1566,0,0,1,-1
1922403,1267214242052288520,702785,Sun May 31 21:59:53 +0000 2020,205,218,0,0,3,-1


We have used a few new functions in the above cell:
- `df.rename()` changes the labels of the specified columns to the new ones.
- `df.drop()` deletes the `user_id` column here. It can also delete rows, e.g. for deleting rows 0 and 1, you can write `df.drop([0, 1])`.1
- `pd.to_datetime()` converts its arguments to a scalar, array-like, Series or DataFrame/dict-like to a pandas datetime object. By specifying the format of the argument, we can get shorter runtimes in case the data is large, like here.


### 3.9.2 entities_df

In this section, we want to create a seperate dataframe for entites, in a way that each entity is splitted to its original, annotated and score values. We also want every row (entity) to have a value in its `selection` column, which shows how many times that entity has been selected by users in the whole data.

It's done like this:

In [28]:
# Creating a dataframe for our purpose:

entities_df = pd.DataFrame()

# First, we make a seperate row in the data for every single entity in each entities list, using pd.explode():

res = tweets['entities'].explode('entities')
entities_df['entities'] = res.value_counts().index

# Then we put original, annotated and score values of each entity in sepereate lists:

original = ['']
annotated = ['']
score = ['']

for i in entities_df['entities'][1:]:      
          
    split = i.split(':')
    original.append(split[0])
    annotated.append(split[1])
    score.append(split[2])
    
# Then we make original, annotated and score columns using those lists:

entities_df['original'] = original
entities_df['annotated'] = annotated
entities_df['score'] = score

# Then we get the selections of each entity, sort the rows based on them and delete extra columns:

entities_df['selections'] = res.value_counts().reindex().to_numpy()

entities_df = entities_df.drop(entities_df.index[0]).reset_index()
del entities_df['index']
    
entities_df

Unnamed: 0,entities,original,annotated,score,selections
0,covid 19:Coronavirus_disease_2019:-1.535776454...,covid 19,Coronavirus_disease_2019,-1.535776454600282,140954
1,quarantine:Quarantine:-2.3096035868012508,quarantine,Quarantine,-2.3096035868012508,71016
2,china:China:-2.113921624336916,china,China,-2.113921624336916,57440
3,social distancing:Social_distancing:-1.4103273...,social distancing,Social_distancing,-1.4103273474020743,38509
4,ppe:Philosophy%2C_politics_and_economics:-2.48...,ppe,Philosophy%2C_politics_and_economics,-2.481280260595,16559
...,...,...,...,...,...
182410,cradle of civilization:Cradle_of_civilization:...,cradle of civilization,Cradle_of_civilization,-1.2882848955688024,1
182411,la ong fong:La-Ong-Fong:-2.6390572166801127,la ong fong,La-Ong-Fong,-2.6390572166801127,1
182412,john ward:John_Ward_%28umpire%29:-2.0444200293...,john ward,John_Ward_%28umpire%29,-2.0444200293125485,1
182413,iqbal ahmed:Iqbal_Ahmed:-2.563177904831185,iqbal ahmed,Iqbal_Ahmed,-2.563177904831185,1


As you can see, there are 182415 different entities in the whole data. Also, the indexing of the rows are based on the descending order of selections.

We have used a new function in the above cell:
- `Series.value_counts()` returns a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. It excludes NA values by default.

### 3.9.3 mentions_df

In this secion, we want to make the `mentions_df`, in a way that it shows every single mention by its selection number. The rows should be sorted and indexed based on `selections`.

In [30]:
# Creating an empty dataframe for our purpose

mentions_df = pd.DataFrame()

# Putting each single mention in a seperate row:

res = tweets['mentions'].explode('mentions')
mentions_df['mentions'] = res.value_counts().index

# Getting the selection numbers, sorting the data based on that, and deleting the extra columns:

mentions_df['selections'] = res.value_counts().reindex().to_numpy()

mentions_df = mentions_df.drop(mentions_df.index[0]).reset_index()
del mentions_df['index']

mentions_df

Unnamed: 0,mentions,selections
0,realDonaldTrump,38064
1,PMOIndia,6382
2,narendramodi,6368
3,jaketapper,5928
4,YouTube,5682
...,...,...
674791,ArtsandScraps,1
674792,BluesBrewsBrats,1
674793,bekkawhite86,1
674794,GSE_Sports,1


As you can see, there are 674796 different mentions in the whole dataset.

### 3.9.4 hashtags_df

In this secion, we make `hashtags_df` just like how we made `mentions_df`:

In [31]:
# Creating an empty dataframe for our purpose:

hashtags_df = pd.DataFrame()

# Putting each single hashtag in a seperate row:

res = tweets['hashtags'].explode('hashtags')
hashtags_df['hashtags'] = res.value_counts().index

# Getting the selection numbers, sorting the data based on that, and deleting the extra columns:

hashtags_df['selections'] = res.value_counts().reindex().to_numpy()

hashtags_df = hashtags_df.drop(hashtags_df.index[0]).reset_index()
del hashtags_df['index']

hashtags_df

Unnamed: 0,hashtags,selections
0,COVID19,67655
1,coronavirus,30430
2,Covid_19,11059
3,covid19,10683
4,Covid19,9439
...,...,...
328618,ResignMayorJacobFrey,1
328619,DoubleChamberedTaxi,1
328620,noneed🙅🏾‍♂️,1
328621,uruala,1


As you can see, there are 674796 different hashtags in the whole dataset.

### 3.9.5 urls_df

Make `urls_df` just like how we made `mentions_df`:

In [34]:
# Creating an empty dataframe for our purpose:

urls_df = pd.DataFrame()

# Putting each single url in a seperate row:

res = tweets['urls'].explode('urls')

urls_df['urls'] = res.value_counts().index

# Getting the selection numbers, sorting the data based on that, and deleting the extra columns:

urls_df['selections'] = res.value_counts().reindex().to_numpy()

urls_df = urls_df.drop(urls_df.index[0]).reset_index()
del urls_df['index']

urls_df

Unnamed: 0,urls,selections
0,https://www.twittascope.com/?sign=5,556
1,https://api.whatsapp.com/send?phone=9190393567...,371
2,http://rebrand.ly/work-2020,286
3,https://www.twittascope.com/?sign=6,271
4,https://redcross.give.asia/campaign/essentials...,260
...,...,...
422630,https://quoteinvestigator.com/2013/12/11/canno...,1
422631,https://www.nature.com/articles/s41591-020-091...,1
422632,https://new.brighton-hove.gov.uk/news/2020/mor...,1
422633,https://www.youtube.com/watch?v=QT_IL2fpgcs&fe...,1


### 3.9.6 tweets_entities dataframe

Now we want to create a dataframe that links `tweets_id`s and `entity_id`s. These ids are in fact the indices in `tweets_df` and `entities_df`, respectively.

Here is how it's done:

In [13]:
# First we get the entities column of the tweets_sorted dataframe that we have from tweets_df section (It will have
# the indices of the sorted tweets based on timestamps)

tweets_entities = tweets_sorted['entities'].reset_index()

# Then we make a seperate row for every entity again:

tweets_entities.rename(columns = {'index':'tweet_id'}, inplace = True)
tweets_entities = tweets_entities.explode('entities')

entities_df = entities_df.reset_index()

# Deleting the columns that have no entities:

tweets_entities = tweets_entities[tweets_entities.entities != '']

# Using pd.merge(), we assign indices in entities_df to their equivalents:

merged = pd.merge(entities_df, tweets_entities, left_on='entities', right_on='entities', how='right')

# Index is now the entity_id:

merged.rename(columns = {'index':'entity_id'}, inplace = True)

# Deleting the unnecessary columns and finalizing the dataframe:

del merged['entities']
del merged['original']
del merged['annotated']
del merged['score']
del merged['selections']

cols = ['tweet_id'] + ['entity_id']
merged = merged[cols]
tweets_entities = merged.copy()

del entities_df['index']

tweets_entities

Unnamed: 0,tweet_id,entity_id
0,0,16611
1,0,6716
2,0,383
3,1,9942
4,2,9554
...,...,...
2757735,1912060,16
2757736,1912061,851
2757737,1912061,1752
2757738,1912061,130908


### 3.9.7 tweets_hashtags dataframe

In [14]:
# First we get the hashtags column of the tweets_sorted dataframe that we have from tweets_df section (It will have
# the indices of the sorted tweets based on timestamps)

tweets_hashtags = tweets_sorted['hashtags'].reset_index()

# Then we make a seperate row for every hashtag again:

tweets_hashtags.rename(columns = {'index':'tweet_id'}, inplace = True)
tweets_hashtags = tweets_hashtags.explode('hashtags')

hashtags_df = hashtags_df.reset_index()

# Deleting the columns that have no hashtags:

tweets_hashtags = tweets_hashtags[tweets_hashtags.hashtags != '']

# Using pd.merge(), we assign indices in hashtags_df to their equivalents:

merged = pd.merge(hashtags_df, tweets_hashtags, left_on='hashtags', right_on='hashtags', how='right')

# Index is now the hashtag_id:

merged.rename(columns = {'index':'hashtag_id'}, inplace = True)
del merged['hashtags']
del merged['selections']

# Deleting the unnecessary columns and finalizing the dataframe:

cols = ['tweet_id'] + ['hashtag_id']
merged = merged[cols]
tweets_hashtags = merged.copy()

del hashtags_df['index']

tweets_hashtags

Unnamed: 0,tweet_id,hashtag_id
0,0,438
1,1,179308
2,1,8943
3,1,14935
4,1,10668
...,...,...
1551213,1912035,241
1551214,1912035,4688
1551215,1912047,9525
1551216,1912051,674


### 3.9.8 tweets_mentions dataframe

In [15]:
# First we get the mentions column of the tweets_sorted dataframe that we have from tweets_df section (It will have
# the indices of the sorted tweets based on timestamps)

tweets_mentions = tweets_sorted['mentions'].reset_index()

# Then we make a seperate row for every mention again:

tweets_mentions.rename(columns = {'index':'tweet_id'}, inplace = True)
tweets_mentions = tweets_mentions.explode('mentions')

mentions_df = mentions_df.reset_index()

# Deleting the columns that have no mentions:

tweets_mentions = tweets_mentions[tweets_mentions.mentions != '']

# Using pd.merge(), we assign indices in mentions_df to their equivalents:

merged = pd.merge(mentions_df, tweets_mentions, left_on='mentions', right_on='mentions', how='right')

# Index is now the mention_id:

merged.rename(columns = {'index':'mention_id'}, inplace = True)

# Deleting the unnecessary columns and finalizing the dataframe:

del merged['mentions']
del merged['selections']

cols = ['tweet_id'] + ['mention_id']
merged = merged[cols]
tweets_mentions = merged.copy()

del mentions_df['index']

tweets_mentions

Unnamed: 0,tweet_id,mention_id
0,3,29152
1,8,108796
2,9,15649
3,13,440995
4,17,435743
...,...,...
2009441,1912060,90696
2009442,1912061,392590
2009443,1912061,392589
2009444,1912061,42533


### 3.9.9 tweets_urls dataframe

In [16]:
# First we get the mentions column of the tweets_sorted dataframe that we have from tweets_df section (It will have
# the indices of the sorted tweets based on timestamps)

tweets_urls = tweets_sorted['urls'].reset_index()

# Then we make a seperate row for every url:

tweets_urls.rename(columns = {'index':'tweet_id'}, inplace = True)
tweets_urls = tweets_urls.explode('urls')

urls_df = urls_df.reset_index()

# Deleting the columns that have no urls:

tweets_urls = tweets_urls[tweets_urls.urls != '']

# Using pd.merge(), we assign indices in urls_df to their equivalents:

merged = pd.merge(urls_df, tweets_urls, left_on='urls', right_on='urls', how='right')

# Index is now the url_id:

merged.rename(columns = {'index':'url_id'}, inplace = True)

# Deleting the unnecessary columns and finalizing the dataframe:

del merged['urls']
del merged['selections']

cols = ['tweet_id'] + ['url_id']
merged = merged[cols]
tweets_urls = merged.copy()

del urls_df['index']

tweets_urls

Unnamed: 0,tweet_id,url_id
0,0,64962
1,2,11746
2,3,336798
3,4,261083
4,5,44481
...,...,...
534002,1912002,11972
534003,1912010,390206
534004,1912017,8366
534005,1912028,73475


# 4. Data handling with Pandas

In this section, we will work on some real data from twitter. It's taken from [TweetsCOV19 dataset](https://data.gesis.org/tweetscov19/); a semantically annotated corpus of Tweets about the COVID-19 pandemic.

Download the file https://zenodo.org/record/4593502/files/TweetsCOV19_052020.tsv.gz, and after extraction, store it in the ./data directory.

It's a tweets dataset containing tab seperated values (tsv). For tsv or csv files, we use `pandas.read_csv()` method in order to read the data. For the complete information on the method you can check [this link](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).



In [1]:
import pandas as pd

tweets = pd.read_csv('./data/data', sep='\t', header=None, quoting= 3)

In [2]:
tweets

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,1 -1,null;,Opinion Next2blowafrica thoughts,null;
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,1 -1,null;,null;,null;
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,2 -1,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,1 -1,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,1 -4,null;,null;,null;
...,...,...,...,...,...,...,...,...,...,...,...,...
1922400,1267207472424660992,ae1b1e6bf2a30cd0e1047ddd0baf5ad0,Sun May 31 21:32:59 +0000 2020,15,45,0,0,spotify:Spotify:-0.9407337067771776;wifi:Wi-Fi...,2 -1,null;,null;,null;
1922401,1267207883487354881,0e4323d01d164b9eb6e33f35564c7e25,Sun May 31 21:34:37 +0000 2020,43,931,0,0,china:China:-2.113921624336916;death penalty:C...,1 -2,null;,null;,null;
1922402,1267209309559173122,00fc2c96e4012e27a6eee351723ab461,Sun May 31 21:40:17 +0000 2020,256,451,0,0,null;,2 -1,null;,null;,null;
1922403,1267212987938545667,0f99a3b8b0d490f062215575d074518b,Sun May 31 21:54:54 +0000 2020,1467,1505,0,0,omg:OMG_%28Usher_song%29:-2.580063760606172;,2 -1,lsddrq,null;,null;


For adding the right labels to the columns, we do the following:

In [3]:
tweets.columns = ['tweet_id', 'username', 'timestamp', 'followers', 'friends', 'retweets', 'favorites', 'entities', 'sentiment', 'mentions', 'hashtags', 'urls']
tweets

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,sentiment,mentions,hashtags,urls
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,1 -1,null;,Opinion Next2blowafrica thoughts,null;
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,1 -1,null;,null;,null;
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,2 -1,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,1 -1,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,1 -4,null;,null;,null;
...,...,...,...,...,...,...,...,...,...,...,...,...
1922400,1267207472424660992,ae1b1e6bf2a30cd0e1047ddd0baf5ad0,Sun May 31 21:32:59 +0000 2020,15,45,0,0,spotify:Spotify:-0.9407337067771776;wifi:Wi-Fi...,2 -1,null;,null;,null;
1922401,1267207883487354881,0e4323d01d164b9eb6e33f35564c7e25,Sun May 31 21:34:37 +0000 2020,43,931,0,0,china:China:-2.113921624336916;death penalty:C...,1 -2,null;,null;,null;
1922402,1267209309559173122,00fc2c96e4012e27a6eee351723ab461,Sun May 31 21:40:17 +0000 2020,256,451,0,0,null;,2 -1,null;,null;,null;
1922403,1267212987938545667,0f99a3b8b0d490f062215575d074518b,Sun May 31 21:54:54 +0000 2020,1467,1505,0,0,omg:OMG_%28Usher_song%29:-2.580063760606172;,2 -1,lsddrq,null;,null;


As you can see, now all the 12 columns of the dataframe have labels.

Here is some more information on the dataframe columns:

- Tweet Id: Long.
- Username: String. Encrypted for privacy issues.
- Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).
- #Followers: Integer.
- #Friends: Integer.
- #Retweets: Integer.
- #Favorites: Integer.
- Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".
- Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").
- Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".
- Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".
- URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"


### Splitting `sentiment` values

Let's say we want to split the negative and positive values in the `sentiment` column, add each in a new column (`sentiment_pos` & `sentiment_neg`) and remove the `sentiment` column. We can do it like this:

In [4]:
# Putting the pos/neg values in the lists:

pos = []
neg = []

for i in tweets['sentiment']:
    pos.append(i.split()[0])
    neg.append(i.split()[1])

# Making the new columns from the lists:

tweets['sentiment_pos'] = pos
tweets['sentiment_neg'] = neg

# Deleting the sentiment column:

del tweets['sentiment']

In [5]:
tweets.head(5)

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,mentions,hashtags,urls,sentiment_pos,sentiment_neg
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,null;,Opinion Next2blowafrica thoughts,null;,1,-1
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,null;,null;,null;,1,-1
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...,2,-1
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...,1,-1
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,null;,null;,null;,1,-4


As you can see, the `sentiment_pos` and `sentiment_neg` columns have been added to the end of the dataframe and `sentiment` is deleted.

### Putting `entities`, `mentions`, `hashtags` and `urls` values into lists

Let's say we want to put the values of each of the four above-mentioned columns into lists. We are going to need this in our further steps of data manipulation.

In order to do that, we can define the right functions and then `apply()` them to the columns:

In [6]:
# function for putting hashtags and mentions values into lists:

def f1(cell):
    if cell == 'null;' or type(cell) == float:
        cell = ['']
    else:
        cell = cell.split()
    return cell

In [7]:
tweets['hashtags'] = tweets['hashtags'].apply(f1)

In [8]:
tweets['mentions'] = tweets['mentions'].apply(f1)

In [9]:
tweets.head(5)

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,mentions,hashtags,urls,sentiment_pos,sentiment_neg
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,[],"[Opinion, Next2blowafrica, thoughts]",null;,1,-1
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,[],[],null;,1,-1
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,[],[],https://www.bbc.com/news/uk-england-beds-bucks...,2,-1
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,[],[],https://lockdownsceptics.org/2020/04/30/latest...,1,-1
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,[],[],null;,1,-4


Since the values in `hashtags` and `mentions` columns are seperated with space characters, we had to split them based on that. For `entities` and `urls`, they are seperated with `;` and `:-:`, so we need to change the function accordingly:

In [10]:
def f2(cell, column):
    if cell == 'null;' or type(cell) == float:
        cell = ['']
    else:
        if column == 'entities':
            splitted = cell.split(';')
        if column == 'urls':
            splitted = cell.split(':-:')
        del splitted[-1]
        cell = splitted
        
    return cell


# applying f2 to entities and urls columns:

tweets['entities'] = tweets['entities'].apply(f2, column = 'entities')
tweets['urls'] = tweets['urls'].apply(f2, column = 'urls')

In [11]:
tweets.head(5)

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,mentions,hashtags,urls,sentiment_pos,sentiment_neg
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,[],[],"[Opinion, Next2blowafrica, thoughts]",[],1,-1
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,[],[],[],[],1,-1
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,[],[],[],[https://www.bbc.com/news/uk-england-beds-buck...,2,-1
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,[],[],[],[https://lockdownsceptics.org/2020/04/30/lates...,1,-1
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,"[i hate u:I_Hate_U:-1.8786140035817729, quaran...",[],[],[],1,-4


For more information on the `apply()` function, you can check its official documentation [here](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html).

### Making ` users` dataframe

Now let's say, we want to make a dataframe out of `tweets`, that gives us some useful information about the users in our main dataframe.

We want this dataframe to contain all usernames in one column, the maximum number of followers of each username (in the whole data) in another column, and the maximum number of friends of each username (in the whole data) in the other column. Finally, we want all these usernames sorted by the maximum followers.

It could be done like this:

*Note: This may take around 90 seconds to run.*

In [12]:
# Getting the maximum number of followers for each user:

followers_max = tweets.loc[tweets.groupby('username')['followers'].idxmax()]
followers_max = followers_max.reset_index()

# Getting the maximum number of friends for each user:

friends_max = tweets.loc[tweets.groupby('username')['friends'].idxmax()]
friends_max = friends_max.reset_index()

# Putting them all in the new dataframe:

users = pd.DataFrame()
users['username'] = friends_max['username']
users['followers_max'] = followers_max['followers']
users['friends_max'] = friends_max['friends']

# Sorting the rows ascendingly, based on followers_max:

users = users.sort_values('followers_max', ascending = False).reset_index()

# Deleting the index columns, since we have reset the indices:

del users['index']

users

Unnamed: 0,username,followers_max,friends_max
0,c1d4d177b4028f2b6ea90a3617c32fb6,117926717,606040
1,0b64e075d55e5221457d3e22ba3dcc14,111636059,299530
2,7cd534d396546a50ddd2dea9ee7f9145,108555597,224
3,a075253a703c963c96f819be90e82a67,81495144,123005
4,75224fc65ae453fe9ec3ca855cd8619b,80751709,46
...,...,...,...
1122828,5713bd926a702800a5b24131d42fdc9b,0,38
1122829,3cab3ac9ba90bdce7bef2ffb17820fa9,0,28
1122830,b19b8939c414154978b1b996b27fc0c6,0,103
1122831,3caad3ca8cc57b5eb5321530cc8fd4b8,0,25


As you can see, the number of rows of the `users` dataframe is hundreds of thousands less that those of `tweets`. It's because in the `tweets` dataframe, every username could have many tweets, but in `users`, each username has occured only once. So, the number of the rows in `users` dataframe is in fact the number of users who have twitted in the `tweets` data.

The user with the highest number of followers is Barack Obama, according to [this link](https://en.wikipedia.org/wiki/List_of_most-followed_Twitter_accounts#:~:text=This%20list%20contains%20the%20top,over%20100%20million%20followers%20each.)!

For more information on the [groupby()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html), [idxmax()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html) and [sort_values()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) functions and their possible arguments, you can click on the links.

### Making `tweets_df`

We want to make a dataframe called **tweets_df**, in a way that it contains other columns like `identifier`, `timestamp`, `followers`, `friends`, `retweets`, `favorites`, `sentiment_pos` and `sentiment_neg`, together with a new column as `user_id`, which has username indices based on the `users` dataframe that we made in the previous section, and is in fact replaced with `username` column. we also want this new **tweets_df** be sorted and indexed by the `timestamps` values.

It would be made like this:

*Note: This may take around 35 seconds to run.*

In [13]:
#First we get a copy of the tweets dataframe and do our manipulations on that:

tweets_df = tweets.copy()

# Then we rename identifier and username columns:

tweets_df.rename(columns = {'tweet_id':'identifier', 'username':'user_id'}, inplace = True)

# We do a reset_index() on users dataframe, so that its indices will have a new column "index", and we delete its
# followers_max and friends_max columns:

temp_users = users.reset_index()
del temp_users['followers_max']
del temp_users['friends_max']

# Using pd.merge(), we assign indices in users dataframe to their equivalent usernames:

merged = pd.merge(temp_users, tweets_df, left_on='username', right_on='user_id', how='left').drop('user_id', axis=1)
merged = merged.rename(columns={"index": "user_id"})
del merged['username']

# For sorting timestamps, first we convert every timestamp to an easily sortable format:

modified_timestamps = pd.to_datetime(merged['timestamp'], format = '%a %b %d %X %z %Y')

merged['modified_timestamps'] = modified_timestamps

# We sort the rows based on new timestamps:

merged = merged.sort_values(by=['modified_timestamps']).reset_index()
del merged['modified_timestamps']
del merged['index']

# We assign the right labels to columns:

cols = ['identifier'] + ['user_id'] + [col for col in merged if (col != 'identifier' and col != 'user_id')]
merged = merged[cols]
tweets_sorted = merged
tweets_df = tweets_sorted.copy()

# Finally, we delete the unnecessary columns:

del tweets_df['entities']
del tweets_df['mentions']
del tweets_df['hashtags']
del tweets_df['urls']
del merged

tweets_df

Unnamed: 0,identifier,user_id,timestamp,followers,friends,retweets,favorites,sentiment_pos,sentiment_neg
0,1255980246559408128,47593,Thu Apr 30 22:00:00 +0000 2020,26865,80,1,0,1,-3
1,1255980248728035329,14973,Thu Apr 30 22:00:00 +0000 2020,120440,69187,78,90,1,-2
2,1255980247683674113,378,Thu Apr 30 22:00:00 +0000 2020,6149624,462,32,104,1,-1
3,1255980246370676737,9854,Thu Apr 30 22:00:00 +0000 2020,200432,1880,14,50,2,-3
4,1255980246995570692,219979,Thu Apr 30 22:00:00 +0000 2020,1714,112,37,46,2,-1
...,...,...,...,...,...,...,...,...,...
1922400,1267214225270685696,793276,Sun May 31 21:59:49 +0000 2020,118,419,0,0,1,-4
1922401,1267214225283231744,756893,Sun May 31 21:59:49 +0000 2020,149,341,0,0,2,-3
1922402,1267214229469310978,323429,Sun May 31 21:59:50 +0000 2020,1465,1566,0,0,1,-1
1922403,1267214242052288520,702785,Sun May 31 21:59:53 +0000 2020,205,218,0,0,3,-1


We have used a few new functions in the above cell. You can access the complete information on them from the official pandas documentation:
- [df.copy()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html)
- [df.rename()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)
- [pd.merge()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)
- [pd.to_date_time()](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)

### Making `entities_df`

In this sub-section, we want to create a seperate dataframe for `entites`, in a way that each entity is splitted to its original, annotated and score values. We also want every row (entity) to have a value in its `selection` column, which shows how many times that entity has been selected by the users in the whole data.

It's done like this:

In [14]:
# Creating a dataframe for our purpose:

entities_df = pd.DataFrame()

# First, we make a seperate row in the data for every single entity in each entities list, using pd.explode():

res = tweets['entities'].explode('entities')
entities_df['entities'] = res.value_counts().index

# Then we put original, annotated and score values of each entity in sepereate lists:

original = ['']
annotated = ['']
score = ['']

for i in entities_df['entities'][1:]:      
          
    split = i.split(':')
    original.append(split[0])
    annotated.append(split[1])
    score.append(split[2])
    
# Then we make original, annotated and score columns using those lists:

entities_df['original'] = original
entities_df['annotated'] = annotated
entities_df['score'] = score

# Then we get the selections of each entity, sort the rows based on them and delete extra columns:

entities_df['selections'] = res.value_counts().reindex().to_numpy()

entities_df = entities_df.drop(entities_df.index[0]).reset_index()
del entities_df['index']
    
entities_df

Unnamed: 0,entities,original,annotated,score,selections
0,covid 19:Coronavirus_disease_2019:-1.535776454...,covid 19,Coronavirus_disease_2019,-1.535776454600282,140954
1,quarantine:Quarantine:-2.3096035868012508,quarantine,Quarantine,-2.3096035868012508,71016
2,china:China:-2.113921624336916,china,China,-2.113921624336916,57440
3,social distancing:Social_distancing:-1.4103273...,social distancing,Social_distancing,-1.4103273474020743,38509
4,ppe:Philosophy%2C_politics_and_economics:-2.48...,ppe,Philosophy%2C_politics_and_economics,-2.481280260595,16559
...,...,...,...,...,...
182410,cradle of civilization:Cradle_of_civilization:...,cradle of civilization,Cradle_of_civilization,-1.2882848955688024,1
182411,la ong fong:La-Ong-Fong:-2.6390572166801127,la ong fong,La-Ong-Fong,-2.6390572166801127,1
182412,john ward:John_Ward_%28umpire%29:-2.0444200293...,john ward,John_Ward_%28umpire%29,-2.0444200293125485,1
182413,iqbal ahmed:Iqbal_Ahmed:-2.563177904831185,iqbal ahmed,Iqbal_Ahmed,-2.563177904831185,1


As you can see, there are 182415 different entities in the whole data. Also, the indexing of the rows are based on the descending order of selections.

Also, note that each entity is seperated to its original, annotated and score with a `:` character, so we need to split it based on that.

The new functions used in the above cell:
- [pd.explode()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)
- [Series.value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)

### Making `mentions_df`

Now we want to make the `mentions_df`, in a way that it shows every single mention by its selection number. The rows should be sorted and indexed based on `selections`.

In [15]:
# Creating an empty dataframe for our purpose

mentions_df = pd.DataFrame()

# Putting each single mention in a seperate row:

res = tweets['mentions'].explode('mentions')
mentions_df['mentions'] = res.value_counts().index

# Getting the selection numbers, sorting the data based on that, and deleting the extra columns:

mentions_df['selections'] = res.value_counts().reindex().to_numpy()

mentions_df = mentions_df.drop(mentions_df.index[0]).reset_index()
del mentions_df['index']

mentions_df

Unnamed: 0,mentions,selections
0,realDonaldTrump,38064
1,PMOIndia,6382
2,narendramodi,6368
3,jaketapper,5928
4,YouTube,5682
...,...,...
674791,ArtsandScraps,1
674792,BluesBrewsBrats,1
674793,bekkawhite86,1
674794,GSE_Sports,1


As you can see, there are 674796 different mentions in the whole dataset.

### Making `hashtags_df`

We make `hashtags_df` just like how we made `mentions_df`:

In [16]:
# Creating an empty dataframe for our purpose:

hashtags_df = pd.DataFrame()

# Putting each single hashtag in a seperate row:

res = tweets['hashtags'].explode('hashtags')
hashtags_df['hashtags'] = res.value_counts().index

# Getting the selection numbers, sorting the data based on that, and deleting the extra columns:

hashtags_df['selections'] = res.value_counts().reindex().to_numpy()

hashtags_df = hashtags_df.drop(hashtags_df.index[0]).reset_index()
del hashtags_df['index']

hashtags_df

Unnamed: 0,hashtags,selections
0,COVID19,67655
1,coronavirus,30430
2,Covid_19,11059
3,covid19,10683
4,Covid19,9439
...,...,...
328618,ResignMayorJacobFrey,1
328619,DoubleChamberedTaxi,1
328620,noneed🙅🏾‍♂️,1
328621,uruala,1


As you can see, there are 328623 different hashtags in the whole dataset.

### Making `urls_df`

We make the `urls_df` similarly:

In [17]:
# Creating an empty dataframe for our purpose:

urls_df = pd.DataFrame()

# Putting each single url in a seperate row:

res = tweets['urls'].explode('urls')

urls_df['urls'] = res.value_counts().index

# Getting the selection numbers, sorting the data based on that, and deleting the extra columns:

urls_df['selections'] = res.value_counts().reindex().to_numpy()

urls_df = urls_df.drop(urls_df.index[0]).reset_index()
del urls_df['index']

urls_df

Unnamed: 0,urls,selections
0,https://www.twittascope.com/?sign=5,556
1,https://api.whatsapp.com/send?phone=9190393567...,371
2,http://rebrand.ly/work-2020,286
3,https://www.twittascope.com/?sign=6,271
4,https://redcross.give.asia/campaign/essentials...,260
...,...,...
422630,https://quoteinvestigator.com/2013/12/11/canno...,1
422631,https://www.nature.com/articles/s41591-020-091...,1
422632,https://new.brighton-hove.gov.uk/news/2020/mor...,1
422633,https://www.youtube.com/watch?v=QT_IL2fpgcs&fe...,1


### Making `tweets_entities` dataframe

Now we want to create a dataframe that links **tweet_id**s and **entity_id**s. These ids are in fact the indices in `tweets_df` and `entities_df`, respectively.

Here is how it's done:

In [18]:
# First we get the entities column of the tweets_sorted dataframe that we have from tweets_df section (It will have
# the indices of the sorted tweets based on timestamps)

tweets_entities = tweets_sorted['entities'].reset_index()

# Then we make a seperate row for every entity again:

tweets_entities.rename(columns = {'index':'tweet_id'}, inplace = True)
tweets_entities = tweets_entities.explode('entities')

entities_df = entities_df.reset_index()

# Deleting the columns that have no entities:

tweets_entities = tweets_entities[tweets_entities.entities != '']

# Using pd.merge(), we assign indices in entities_df to their equivalents:

merged = pd.merge(entities_df, tweets_entities, left_on='entities', right_on='entities', how='right')

# Index is now the entity_id:

merged.rename(columns = {'index':'entity_id'}, inplace = True)

# Deleting the unnecessary columns and finalizing the dataframe:

del merged['entities']
del merged['original']
del merged['annotated']
del merged['score']
del merged['selections']

cols = ['tweet_id'] + ['entity_id']
merged = merged[cols]
tweets_entities = merged.copy()

del entities_df['index']

tweets_entities

Unnamed: 0,tweet_id,entity_id
0,0,18
1,0,32725
2,0,22847
3,3,17104
4,3,6782
...,...,...
2772153,1922402,16
2772154,1922403,859
2772155,1922403,1758
2772156,1922403,131523


### Making `tweets_hashtags` dataframe

Similarly, we want to create a dataframe that links **tweet_id**s and **hashtag_id**s. These ids are in fact the indices in `tweets_df` and `hashtags_df`, respectively.

Here is how it's done:

In [19]:
# First we get the hashtags column of the tweets_sorted dataframe that we have from tweets_df section (It will have
# the indices of the sorted tweets based on timestamps)

tweets_hashtags = tweets_sorted['hashtags'].reset_index()

# Then we make a seperate row for every hashtag again:

tweets_hashtags.rename(columns = {'index':'tweet_id'}, inplace = True)
tweets_hashtags = tweets_hashtags.explode('hashtags')

hashtags_df = hashtags_df.reset_index()

# Deleting the columns that have no hashtags:

tweets_hashtags = tweets_hashtags[tweets_hashtags.hashtags != '']

# Using pd.merge(), we assign indices in hashtags_df to their equivalents:

merged = pd.merge(hashtags_df, tweets_hashtags, left_on='hashtags', right_on='hashtags', how='right')

# Index is now the hashtag_id:

merged.rename(columns = {'index':'hashtag_id'}, inplace = True)
del merged['hashtags']
del merged['selections']

# Deleting the unnecessary columns and finalizing the dataframe:

cols = ['tweet_id'] + ['hashtag_id']
merged = merged[cols]
tweets_hashtags = merged.copy()

del hashtags_df['index']

tweets_hashtags

Unnamed: 0,tweet_id,hashtag_id
0,1,8313
1,2,276894
2,2,0
3,2,7496
4,3,413
...,...,...
1477991,1922376,218
1477992,1922376,4664
1477993,1922389,9414
1477994,1922393,643


### Making `tweets_mentions` dataframe

Again, we create a dataframe that links **tweet_id**s and **mention_id**s. These ids are in fact the indices in `tweets_df` and `mentions_df`, respectively.

Here is how it's done:

In [20]:
# First we get the mentions column of the tweets_sorted dataframe that we have from tweets_df section (It will have
# the indices of the sorted tweets based on timestamps)

tweets_mentions = tweets_sorted['mentions'].reset_index()

# Then we make a seperate row for every mention again:

tweets_mentions.rename(columns = {'index':'tweet_id'}, inplace = True)
tweets_mentions = tweets_mentions.explode('mentions')

mentions_df = mentions_df.reset_index()

# Deleting the columns that have no mentions:

tweets_mentions = tweets_mentions[tweets_mentions.mentions != '']

# Using pd.merge(), we assign indices in mentions_df to their equivalents:

merged = pd.merge(mentions_df, tweets_mentions, left_on='mentions', right_on='mentions', how='right')

# Index is now the mention_id:

merged.rename(columns = {'index':'mention_id'}, inplace = True)

# Deleting the unnecessary columns and finalizing the dataframe:

del merged['mentions']
del merged['selections']

cols = ['tweet_id'] + ['mention_id']
merged = merged[cols]
tweets_mentions = merged.copy()

del mentions_df['index']

tweets_mentions

Unnamed: 0,tweet_id,mention_id
0,2,29788
1,6,95123
2,7,16411
3,15,487099
4,17,502507
...,...,...
2001857,1922402,84209
2001858,1922403,393991
2001859,1922403,393992
2001860,1922403,47294


### Making `tweets_urls` dataframe

Finally, we create a dataframe that links **tweet_id**s and **url_id**s. These ids are in fact the indices in `tweets_df` and `urls_df`, respectively.

Here is how it's done:

In [21]:
# First we get the mentions column of the tweets_sorted dataframe that we have from tweets_df section (It will have
# the indices of the sorted tweets based on timestamps)

tweets_urls = tweets_sorted['urls'].reset_index()

# Then we make a seperate row for every url:

tweets_urls.rename(columns = {'index':'tweet_id'}, inplace = True)
tweets_urls = tweets_urls.explode('urls')

urls_df = urls_df.reset_index()

# Deleting the columns that have no urls:

tweets_urls = tweets_urls[tweets_urls.urls != '']

# Using pd.merge(), we assign indices in urls_df to their equivalents:

merged = pd.merge(urls_df, tweets_urls, left_on='urls', right_on='urls', how='right')

# Index is now the url_id:

merged.rename(columns = {'index':'url_id'}, inplace = True)

# Deleting the unnecessary columns and finalizing the dataframe:

del merged['urls']
del merged['selections']

cols = ['tweet_id'] + ['url_id']
merged = merged[cols]
tweets_urls = merged.copy()

del urls_df['index']

tweets_urls

Unnamed: 0,tweet_id,url_id
0,0,156187
1,1,264087
2,2,339826
3,3,113001
4,5,11287
...,...,...
536905,1922343,14716
536906,1922351,261036
536907,1922358,8012
536908,1922369,77641


### 2.2.2. Working with a single dataframe

In [15]:
# read/save
# describe()
# changing index and column names
# grouping
# using and resetting the index
# categorize series: categories and codes
# matrix to edgelist and vice versa
# zip
# columns into dict
# datetime
# ...

### 2.2.3. Working with multiple dataframes

In [24]:
# merge split concat etc
# ...

## 2.3. NumPy

- https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html
- https://www.pythonlikeyoumeanit.com/module_3.html

In [25]:
# read/save
# relationship to pandas
# ...

## 2.4. SciPy

In [1]:
# sparse matrices
# matrix multiplication

## 2.5. Data visualiation with Seaborn & Matplotlib

SUGGESTION: TEACH HOW TO WITH SEABORN, USE MATPLOTLIB WHERE SEABORN DOES NOT OFFER METHODS

- https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html

# X. References

[<a href='#destination1_'>1</a>] https://docs.python.org/3/howto/unicode.html <a id='destination1'></a>

[<a href='#destination1_'>2</a>] https://www.makeuseof.com/how-to-include-emojis-in-your-python-code/ <a id='destination1'></a>

[<a href='#destination2_'>3</a>] https://numpy.org/doc/stable/user/absolute_beginners.html <a id='destination2'></a>

[<a href='#destination3_'>4</a>] http://pandas.pydata.org/docs/index.html <a id='destination3'></a>