# <center> Pandas*</center>

*pandas is short for Python Data Analysis Library

<img src="https://welovepandas.club/wp-content/uploads/2019/02/panda-bamboo1550035127.jpg" height=350 width=400>

In [1]:
import pandas as pd

In [2]:
pandas.DataFrame()

In pandas you need to work with DataFrames and Series. According to [the documentation of pandas](https://pandas.pydata.org/pandas-docs/stable/):

* **DataFrame**: Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

* **Series**: One-dimensional ndarray with axis labels (including time series).

In [2]:
pd.Series([5, 6, 7, 8, 9, 10])

0     5
1     6
2     7
3     8
4     9
5    10
dtype: int64

In [5]:
pd.DataFrame([1])

Unnamed: 0,0
0,1


In [6]:
pd.DataFrame({'Student': ['1', '2'], 'Name': ['Alice', 'Michael'], 'Surname': ['Brown', 'Williams']})

Unnamed: 0,Student,Name,Surname
0,1,Alice,Brown
1,2,Michael,Williams


In [7]:
some_date = {'Student': ['1', '2'], 'Name': ['Alice', 'Michael'], 'Surname': ['Brown', 'Williams']}

In [8]:
pd.DataFrame(some_date)

Unnamed: 0,Student,Name,Surname
0,1,Alice,Brown
1,2,Michael,Williams


In [10]:
pd.DataFrame([{'Student': ['1', '2'], 'Name': ['Alice', 'Michael'], 'Surname': ['Brown', 'Williams']}])

Unnamed: 0,Student,Name,Surname
0,"[1, 2]","[Alice, Michael]","[Brown, Williams]"


In [26]:
pd.DataFrame([{'Student': '1', 'Name': 'Alice', 'Surname': 'Brown'}, 
            {'Student': '2', 'Name': 'Anna', 'Age': 21}])

Unnamed: 0,Student,Name,Surname,Age
0,1,Alice,Brown,
1,2,Anna,,21.0


Check how to create it:
* pd.DataFrame().from_records()
* pd.DataFrame().from_dict()

In [11]:
pd.DataFrame.from_dict(some_date)

Unnamed: 0,Student,Name,Surname
0,1,Alice,Brown
1,2,Michael,Williams


This data set is too big for github, download it from [here](https://www.kaggle.com/START-UMD/gtd). You will need to register on Kaggle first.

In [13]:
pd.read_csv('globalterrorismdb_0718dist.csv', encoding='ISO-8859-1').head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,


In [15]:
pd.read_csv('globalterrorismdb_0718dist.csv', encoding='ISO-8859-1').head(1)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,


In [17]:
df = pd.read_csv('globalterrorismdb_0718dist.csv', encoding='ISO-8859-1')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Let's explore the second set of data. How many rows and columns are there?

In [21]:
df.shape

(181691, 135)

General information on this data set:

In [24]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Data columns (total 135 columns):
 #   Column              Dtype  
---  ------              -----  
 0   eventid             int64  
 1   iyear               int64  
 2   imonth              int64  
 3   iday                int64  
 4   approxdate          object 
 5   extended            int64  
 6   resolution          object 
 7   country             int64  
 8   country_txt         object 
 9   region              int64  
 10  region_txt          object 
 11  provstate           object 
 12  city                object 
 13  latitude            float64
 14  longitude           float64
 15  specificity         float64
 16  vicinity            int64  
 17  location            object 
 18  summary             object 
 19  crit1               int64  
 20  crit2               int64  
 21  crit3               int64  
 22  doubtterr           float64
 23  alternative         float64
 24  alternative_txt     objec

Let's take a look at the dataset information. In .info (), you can pass additional parameters, including:

* **verbose**: whether to print information about the DataFrame in full (if the table is very large, then some information may be lost);
* **memory_usage**: whether to print memory consumption (the default is True, but you can put either False, which will remove memory consumption, or 'deep', which will calculate the memory consumption more accurately);
* **null_counts**: Whether to count the number of empty elements (default is True).

In [27]:
df.shape

(181691, 135)

In [28]:
df.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,


In [32]:
(2017 - 1970) / 2

23.5

In [33]:
1970 + 23

1993

In [34]:
df.shape

(181691, 135)

In [35]:
df.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,


In [25]:
df.describe()

Unnamed: 0,eventid,iyear,imonth,iday,extended,country,region,latitude,longitude,specificity,...,ransomamt,ransomamtus,ransompaid,ransompaidus,hostkidoutcome,nreleased,INT_LOG,INT_IDEO,INT_MISC,INT_ANY
count,181691.0,181691.0,181691.0,181691.0,181691.0,181691.0,181691.0,177135.0,177134.0,181685.0,...,1350.0,563.0,774.0,552.0,10991.0,10400.0,181691.0,181691.0,181691.0,181691.0
mean,200270500000.0,2002.638997,6.467277,15.505644,0.045346,131.968501,7.160938,23.498343,-458.6957,1.451452,...,3172530.0,578486.5,717943.7,240.378623,4.629242,-29.018269,-4.543731,-4.464398,0.09001,-3.945952
std,1325957000.0,13.25943,3.388303,8.814045,0.208063,112.414535,2.933408,18.569242,204779.0,0.99543,...,30211570.0,7077924.0,10143920.0,2940.967293,2.03536,65.720119,4.543547,4.637152,0.568457,4.691325
min,197000000000.0,1970.0,0.0,0.0,0.0,4.0,1.0,-53.154613,-86185900.0,1.0,...,-99.0,-99.0,-99.0,-99.0,1.0,-99.0,-9.0,-9.0,-9.0,-9.0
25%,199102100000.0,1991.0,4.0,8.0,0.0,78.0,5.0,11.510046,4.54564,1.0,...,0.0,0.0,-99.0,0.0,2.0,-99.0,-9.0,-9.0,0.0,-9.0
50%,200902200000.0,2009.0,6.0,15.0,0.0,98.0,6.0,31.467463,43.24651,1.0,...,15000.0,0.0,0.0,0.0,4.0,0.0,-9.0,-9.0,0.0,0.0
75%,201408100000.0,2014.0,9.0,23.0,0.0,160.0,10.0,34.685087,68.71033,1.0,...,400000.0,0.0,1273.412,0.0,7.0,1.0,0.0,0.0,0.0,0.0
max,201712300000.0,2017.0,12.0,31.0,1.0,1004.0,12.0,74.633553,179.3667,5.0,...,1000000000.0,132000000.0,275000000.0,48000.0,7.0,2769.0,1.0,1.0,1.0,1.0


In [37]:
df.head(30)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,
5,197001010002,1970,1,1,,0,,217,United States,1,...,"The Cairo Chief of Police, William Petersen, r...","""Police Chief Quits,"" Washington Post, January...","""Cairo Police Chief Quits; Decries Local 'Mili...","Christopher Hewitt, ""Political Violence and Te...",Hewitt Project,-9,-9,0,-9,
6,197001020001,1970,1,2,,0,,218,Uruguay,3,...,,,,,PGIS,0,0,0,0,
7,197001020002,1970,1,2,,0,,217,United States,1,...,"Damages were estimated to be between $20,000-$...",Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...",,Hewitt Project,-9,-9,0,-9,
8,197001020003,1970,1,2,,0,,217,United States,1,...,The New Years Gang issue a communiqué to a loc...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...","The Wisconsin Cartographers' Guild, ""Wisconsin...",Hewitt Project,0,0,0,0,
9,197001030001,1970,1,3,,0,,217,United States,1,...,"Karl Armstrong's girlfriend, Lynn Schultz, dro...",Committee on Government Operations United Stat...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...",Hewitt Project,0,0,0,0,


In [38]:
df.describe(include=['object', 'int'])

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
count,181691.0,181691.0,181691.0,181691.0,9239,181691.0,2220,181691.0,181691,181691.0,...,28289,115500,76933,43516,181691,181691.0,181691.0,181691.0,181691.0,25038
unique,,,,,2244,,1859,,205,,...,15429,83988,62263,36090,26,,,,,14306
top,,,,,"September 18-24, 2016",,8/4/1998,,Iraq,,...,Casualty numbers for this incident conflict ac...,Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...","Christopher Hewitt, ""Political Violence and Te...",START Primary Collection,,,,,"201612010023, 201612010024, 201612010025, 2016..."
freq,,,,,101,,18,,24636,,...,1607,205,134,139,78002,,,,,80
mean,200270500000.0,2002.638997,6.467277,15.505644,,0.045346,,131.968501,,7.160938,...,,,,,,-4.543731,-4.464398,0.09001,-3.945952,
std,1325957000.0,13.25943,3.388303,8.814045,,0.208063,,112.414535,,2.933408,...,,,,,,4.543547,4.637152,0.568457,4.691325,
min,197000000000.0,1970.0,0.0,0.0,,0.0,,4.0,,1.0,...,,,,,,-9.0,-9.0,-9.0,-9.0,
25%,199102100000.0,1991.0,4.0,8.0,,0.0,,78.0,,5.0,...,,,,,,-9.0,-9.0,0.0,-9.0,
50%,200902200000.0,2009.0,6.0,15.0,,0.0,,98.0,,6.0,...,,,,,,-9.0,-9.0,0.0,0.0,
75%,201408100000.0,2014.0,9.0,23.0,,0.0,,160.0,,10.0,...,,,,,,0.0,0.0,0.0,0.0,


The describe method shows the basic statistical characteristics of the data for each numeric feature (int64 and float64 types): the number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

How to look only at the column names, index:

In [45]:
df.columns

Index(['eventid', 'iyear', 'imonth', 'iday', 'approxdate', 'extended',
       'resolution', 'country', 'country_txt', 'region',
       ...
       'addnotes', 'scite1', 'scite2', 'scite3', 'dbsource', 'INT_LOG',
       'INT_IDEO', 'INT_MISC', 'INT_ANY', 'related'],
      dtype='object', length=135)

How to look at the first 10 lines?

In [47]:
df.head(10)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,
5,197001010002,1970,1,1,,0,,217,United States,1,...,"The Cairo Chief of Police, William Petersen, r...","""Police Chief Quits,"" Washington Post, January...","""Cairo Police Chief Quits; Decries Local 'Mili...","Christopher Hewitt, ""Political Violence and Te...",Hewitt Project,-9,-9,0,-9,
6,197001020001,1970,1,2,,0,,218,Uruguay,3,...,,,,,PGIS,0,0,0,0,
7,197001020002,1970,1,2,,0,,217,United States,1,...,"Damages were estimated to be between $20,000-$...",Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...",,Hewitt Project,-9,-9,0,-9,
8,197001020003,1970,1,2,,0,,217,United States,1,...,The New Years Gang issue a communiqué to a loc...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...","The Wisconsin Cartographers' Guild, ""Wisconsin...",Hewitt Project,0,0,0,0,
9,197001030001,1970,1,3,,0,,217,United States,1,...,"Karl Armstrong's girlfriend, Lynn Schultz, dro...",Committee on Government Operations United Stat...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...",Hewitt Project,0,0,0,0,


How to look at the last 15 lines?

In [48]:
df.tail(15)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
181676,201712310009,2017,12,31,,0,,4,Afghanistan,6,...,The victims included police commander Faqeer A...,"""Commander among 5 ALP members killed in Logar...","""Media Highlights on Afghanistan 1 January 201...",,START Primary Collection,0,0,0,0,
181677,201712310010,2017,12,31,,0,,160,Philippines,5,...,,"""3 slain in Maguindanao roadside bombings,"" Ph...","""BIFF gunmen torch abandoned houses in Maguind...","""Philippines: Highlights of Terrorist, Counter...",START Primary Collection,0,0,0,0,
181678,201712310011,2017,12,30,,0,,160,Philippines,5,...,"The victims included the owner, Norodin Pacaln...","""Cops hunt North Cotabato bombers,"" Philippine...","""Philippines: Highlights of Terrorist, Counter...",,START Primary Collection,-9,-9,0,-9,
181679,201712310012,2017,12,31,,0,,95,Iraq,10,...,,"""13 IS militants killed in attack on paramilit...",,,START Primary Collection,0,1,0,1,
181680,201712310013,2017,12,31,,0,,182,Somalia,11,...,,"""Somalia's al-Shabab fires mortars at Ethiopia...","""Somalia: Al-Shabaab Militants Shell Ethiopian...",,START Primary Collection,0,1,1,1,
181681,201712310016,2017,12,31,,0,,160,Philippines,5,...,The victims included Senior Police Officer 4 M...,"""3 dead, scores injured in Mindanao blasts,"" M...","""Cop, 2 others killed in bomb blasts in Mindan...","""Cop killed, 7 injured in Maguindanao IED blas...",START Primary Collection,0,0,0,0,
181682,201712310017,2017,12,31,,0,,98,Italy,8,...,,"""Arson attack probed as racial crime,"" Ansa.it...","""Ascoli, a building destined for migrants goes...",,START Primary Collection,-9,-9,0,-9,
181683,201712310018,2017,12,31,,0,,4,Afghanistan,6,...,,"""Six Members Of One Family Shot Dead In Faryab...","""Highlights: Pakistan Pashto Press 02 January ...",,START Primary Collection,0,0,0,0,
181684,201712310019,2017,12,31,,0,,92,India,6,...,,"""Abducted PSO rescued within 11 hours,"" The Se...",,,START Primary Collection,0,0,0,0,
181685,201712310020,2017,12,31,,0,,4,Afghanistan,6,...,,"""4 people injured in Farayb explosion,"" Pajhwo...",,,START Primary Collection,-9,-9,0,-9,


How to request only one particular line (by counting lines)? 

In [52]:
df.head(3)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,


How to request only one particular line by  its index?

In [57]:
# return all the rows before the row with the index 3
df.loc[:3]

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,


In [61]:
# return the first three rows
pd.DataFrame(df.iloc[3]).T

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,


Look only at the unique values of some columns. 

In [62]:
df.head(15)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,
5,197001010002,1970,1,1,,0,,217,United States,1,...,"The Cairo Chief of Police, William Petersen, r...","""Police Chief Quits,"" Washington Post, January...","""Cairo Police Chief Quits; Decries Local 'Mili...","Christopher Hewitt, ""Political Violence and Te...",Hewitt Project,-9,-9,0,-9,
6,197001020001,1970,1,2,,0,,218,Uruguay,3,...,,,,,PGIS,0,0,0,0,
7,197001020002,1970,1,2,,0,,217,United States,1,...,"Damages were estimated to be between $20,000-$...",Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...",,Hewitt Project,-9,-9,0,-9,
8,197001020003,1970,1,2,,0,,217,United States,1,...,The New Years Gang issue a communiqué to a loc...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...","The Wisconsin Cartographers' Guild, ""Wisconsin...",Hewitt Project,0,0,0,0,
9,197001030001,1970,1,3,,0,,217,United States,1,...,"Karl Armstrong's girlfriend, Lynn Schultz, dro...",Committee on Government Operations United Stat...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...",Hewitt Project,0,0,0,0,


In [63]:
df['country_txt']

0         Dominican Republic
1                     Mexico
2                Philippines
3                     Greece
4                      Japan
                 ...        
181686               Somalia
181687                 Syria
181688           Philippines
181689                 India
181690           Philippines
Name: country_txt, Length: 181691, dtype: object

In [64]:
df.country_txt

0         Dominican Republic
1                     Mexico
2                Philippines
3                     Greece
4                      Japan
                 ...        
181686               Somalia
181687                 Syria
181688           Philippines
181689                 India
181690           Philippines
Name: country_txt, Length: 181691, dtype: object

In [65]:
df['country_txt'].unique()

array(['Dominican Republic', 'Mexico', 'Philippines', 'Greece', 'Japan',
       'United States', 'Uruguay', 'Italy', 'East Germany (GDR)',
       'Ethiopia', 'Guatemala', 'Venezuela', 'West Germany (FRG)',
       'Switzerland', 'Jordan', 'Spain', 'Brazil', 'Egypt', 'Argentina',
       'Lebanon', 'Ireland', 'Turkey', 'Paraguay', 'Iran',
       'United Kingdom', 'Colombia', 'Bolivia', 'Nicaragua',
       'Netherlands', 'Belgium', 'Canada', 'Australia', 'Pakistan',
       'Zambia', 'Sweden', 'Costa Rica', 'South Yemen', 'Cambodia',
       'Israel', 'Poland', 'Taiwan', 'Panama', 'Kuwait',
       'West Bank and Gaza Strip', 'Austria', 'Czechoslovakia', 'India',
       'France', 'South Vietnam', 'Brunei', 'Zaire',
       "People's Republic of the Congo", 'Portugal', 'Algeria',
       'El Salvador', 'Thailand', 'Haiti', 'Sudan', 'Morocco', 'Cyprus',
       'Myanmar', 'Afghanistan', 'Peru', 'Chile', 'Honduras',
       'Yugoslavia', 'Ecuador', 'New Zealand', 'Malaysia', 'Singapore',
       'Bot

In [67]:
len(df['country_txt'].unique())

205

In [66]:
df['country_txt'].nunique()

205

How many unique values there are in ```city``` column? = On how many cities this data set hold information on terrorist attacks?

In [69]:
df['city'].nunique()

36674

In what years did the largest number of terrorist attacks occur (according to only to this data set)?

In [72]:
df['iyear'].value_counts().head(5)

2014    16903
2015    14965
2016    13587
2013    12036
2017    10900
Name: iyear, dtype: int64

In [73]:
df['iyear'].value_counts()[:5]

2014    16903
2015    14965
2016    13587
2013    12036
2017    10900
Name: iyear, dtype: int64

In [74]:
df['iyear'].value_counts().tail(5)

1970    651
1974    581
1972    568
1973    473
1971    471
Name: iyear, dtype: int64

In [76]:
df['iyear'].value_counts()[-5:]

1970    651
1974    581
1972    568
1973    473
1971    471
Name: iyear, dtype: int64

How we can sort all rows by year in descending order?

In [80]:
df['iyear'].sort_values(ascending=True)

0         1970
430       1970
431       1970
432       1970
433       1970
          ... 
174425    2017
174426    2017
174427    2017
174419    2017
181690    2017
Name: iyear, Length: 181691, dtype: int64

In [79]:
df['iyear'].sort_values(ascending=False)

181690    2017
174419    2017
174427    2017
174426    2017
174425    2017
          ... 
433       1970
432       1970
431       1970
430       1970
0         1970
Name: iyear, Length: 181691, dtype: int64

In [83]:
df.sort_values(by=['iyear', 'country_txt'], ascending=False)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
176881,201707120016,2017,7,12,,0,,231,Zimbabwe,11,...,,"""Stop violence, Mugabe told following 'savage ...","""Zimbabwean opposition leaders attacked ahead ...","""MDC Tsvangirai Youth Assembly Vehicle Petrol ...",START Primary Collection,0,0,0,0,
177097,201707180028,2017,7,19,,0,,231,Zimbabwe,11,...,,"""Mudzuri blames Zanu PF for arson attack,"" Zim...","""Stop violence, Mugabe told following 'savage ...","""Zimbabwean opposition leaders attacked ahead ...",START Primary Collection,0,0,0,0,
177106,201707180043,2017,7,19,,0,,231,Zimbabwe,11,...,,"""Stop violence, Mugabe told following 'savage ...","""Zimbabwean opposition leaders attacked ahead ...","""Upsurge in cases of Politically motivated Vio...",START Primary Collection,0,0,0,0,
176648,201707040037,2017,7,4,,0,,230,Zambia,11,...,Security forces also claimed to carry out the ...,"""Declaration of near-state of emergency in Zam...","""Zambian leader Lungu warns of 'sabotage' afte...","""Zambia's biggest market gutted, government su...",START Primary Collection,0,0,0,0,
176801,201707090040,2017,7,9,,0,,230,Zambia,11,...,,"""Top news items in major Zambian media outlets...",,,START Primary Collection,-9,-9,0,-9,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,197011150001,1970,11,15,,0,,11,Argentina,3,...,,,,,PGIS,0,0,0,0,
607,197011200001,1970,11,20,,0,,11,Argentina,3,...,,,,,PGIS,0,1,1,1,
608,197011200002,1970,11,20,,0,,11,Argentina,3,...,,,,,PGIS,0,1,1,1,
609,197011200003,1970,11,20,,0,,11,Argentina,3,...,,,,,PGIS,0,1,1,1,


Which data types we have in each column?

In [86]:
dict(df.dtypes)['success']

dtype('int64')

How to check the missing values?

In [87]:
df.head(5)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,


In [88]:
df.isna()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
1,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
2,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
3,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
4,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,False,False,False,False,True,False,True,False,False,False,...,True,False,False,False,False,False,False,False,False,True
181687,False,False,False,False,True,False,True,False,False,False,...,True,False,False,False,False,False,False,False,False,True
181688,False,False,False,False,True,False,True,False,False,False,...,True,False,True,True,False,False,False,False,False,True
181689,False,False,False,False,True,False,True,False,False,False,...,True,False,True,True,False,False,False,False,False,True


In [90]:
df.shape

(181691, 135)

In [93]:
172452 / 181691 * 100

94.91499303762983

In [89]:
df.isna().sum()

eventid            0
iyear              0
imonth             0
iday               0
approxdate    172452
               ...  
INT_LOG            0
INT_IDEO           0
INT_MISC           0
INT_ANY            0
related       156653
Length: 135, dtype: int64

Let's delete a column ```approxdate``` from this data set, because it contains a lot of missing values:

In [96]:
df.drop('approxdate', axis=1, inplace=True)

In [99]:
df.shape

(181691, 134)

In [100]:
df.isna().sum()

eventid          0
iyear            0
imonth           0
iday             0
extended         0
             ...  
INT_LOG          0
INT_IDEO         0
INT_MISC         0
INT_ANY          0
related     156653
Length: 134, dtype: int64

In [101]:
df.dropna(axis=1)

Unnamed: 0,eventid,iyear,imonth,iday,extended,country,country_txt,region,region_txt,vicinity,...,gname,individual,weaptype1,weaptype1_txt,property,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY
0,197000000001,1970,7,2,0,58,Dominican Republic,2,Central America & Caribbean,0,...,MANO-D,0,13,Unknown,0,PGIS,0,0,0,0
1,197000000002,1970,0,0,0,130,Mexico,1,North America,0,...,23rd of September Communist League,0,13,Unknown,0,PGIS,0,1,1,1
2,197001000001,1970,1,0,0,160,Philippines,5,Southeast Asia,0,...,Unknown,0,13,Unknown,0,PGIS,-9,-9,1,1
3,197001000002,1970,1,0,0,78,Greece,8,Western Europe,0,...,Unknown,0,6,Explosives,1,PGIS,-9,-9,1,1
4,197001000003,1970,1,0,0,101,Japan,4,East Asia,0,...,Unknown,0,8,Incendiary,1,PGIS,-9,-9,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,0,182,Somalia,11,Sub-Saharan Africa,0,...,Al-Shabaab,0,5,Firearms,-9,START Primary Collection,0,0,0,0
181687,201712310029,2017,12,31,0,200,Syria,10,Middle East & North Africa,1,...,Muslim extremists,0,6,Explosives,1,START Primary Collection,-9,-9,1,1
181688,201712310030,2017,12,31,0,160,Philippines,5,Southeast Asia,0,...,Bangsamoro Islamic Freedom Movement (BIFM),0,8,Incendiary,1,START Primary Collection,0,0,0,0
181689,201712310031,2017,12,31,0,92,India,6,South Asia,0,...,Unknown,0,6,Explosives,-9,START Primary Collection,-9,-9,0,-9


Create a new variable ```casualties``` by summing up the value in ```Killed``` and ```Wounded```. 

In [104]:
list(df.columns)

['eventid',
 'iyear',
 'imonth',
 'iday',
 'extended',
 'resolution',
 'country',
 'country_txt',
 'region',
 'region_txt',
 'provstate',
 'city',
 'latitude',
 'longitude',
 'specificity',
 'vicinity',
 'location',
 'summary',
 'crit1',
 'crit2',
 'crit3',
 'doubtterr',
 'alternative',
 'alternative_txt',
 'multiple',
 'success',
 'suicide',
 'attacktype1',
 'attacktype1_txt',
 'attacktype2',
 'attacktype2_txt',
 'attacktype3',
 'attacktype3_txt',
 'targtype1',
 'targtype1_txt',
 'targsubtype1',
 'targsubtype1_txt',
 'corp1',
 'target1',
 'natlty1',
 'natlty1_txt',
 'targtype2',
 'targtype2_txt',
 'targsubtype2',
 'targsubtype2_txt',
 'corp2',
 'target2',
 'natlty2',
 'natlty2_txt',
 'targtype3',
 'targtype3_txt',
 'targsubtype3',
 'targsubtype3_txt',
 'corp3',
 'target3',
 'natlty3',
 'natlty3_txt',
 'gname',
 'gsubname',
 'gname2',
 'gsubname2',
 'gname3',
 'gsubname3',
 'motive',
 'guncertain1',
 'guncertain2',
 'guncertain3',
 'individual',
 'nperps',
 'nperpcap',
 'claimed',
 'cl

In [105]:
df['casualties'] = df['nkill'] + df['nwound']

In [106]:
df

Unnamed: 0,eventid,iyear,imonth,iday,extended,resolution,country,country_txt,region,region_txt,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
0,197000000001,1970,7,2,0,,58,Dominican Republic,2,Central America & Caribbean,...,,,,PGIS,0,0,0,0,,1.0
1,197000000002,1970,0,0,0,,130,Mexico,1,North America,...,,,,PGIS,0,1,1,1,,0.0
2,197001000001,1970,1,0,0,,160,Philippines,5,Southeast Asia,...,,,,PGIS,-9,-9,1,1,,1.0
3,197001000002,1970,1,0,0,,78,Greece,8,Western Europe,...,,,,PGIS,-9,-9,1,1,,
4,197001000003,1970,1,0,0,,101,Japan,4,East Asia,...,,,,PGIS,-9,-9,1,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,0,,182,Somalia,11,Sub-Saharan Africa,...,"""Somalia: Al-Shabaab Militants Attack Army Che...","""Highlights: Somalia Daily Media Highlights 2 ...","""Highlights: Somalia Daily Media Highlights 1 ...",START Primary Collection,0,0,0,0,,3.0
181687,201712310029,2017,12,31,0,,200,Syria,10,Middle East & North Africa,...,"""Putin's 'victory' in Syria has turned into a ...","""Two Russian soldiers killed at Hmeymim base i...","""Two Russian servicemen killed in Syria mortar...",START Primary Collection,-9,-9,1,1,,9.0
181688,201712310030,2017,12,31,0,,160,Philippines,5,Southeast Asia,...,"""Maguindanao clashes trap tribe members,"" Phil...",,,START Primary Collection,0,0,0,0,,0.0
181689,201712310031,2017,12,31,0,,92,India,6,South Asia,...,"""Trader escapes grenade attack in Imphal,"" Bus...",,,START Primary Collection,-9,-9,0,-9,,0.0


Rename a column ```iyear``` to ```year```:

In [107]:
df.rename({'iyear' : 'year'}, axis='columns', inplace=True)

In [108]:
df

Unnamed: 0,eventid,year,imonth,iday,extended,resolution,country,country_txt,region,region_txt,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
0,197000000001,1970,7,2,0,,58,Dominican Republic,2,Central America & Caribbean,...,,,,PGIS,0,0,0,0,,1.0
1,197000000002,1970,0,0,0,,130,Mexico,1,North America,...,,,,PGIS,0,1,1,1,,0.0
2,197001000001,1970,1,0,0,,160,Philippines,5,Southeast Asia,...,,,,PGIS,-9,-9,1,1,,1.0
3,197001000002,1970,1,0,0,,78,Greece,8,Western Europe,...,,,,PGIS,-9,-9,1,1,,
4,197001000003,1970,1,0,0,,101,Japan,4,East Asia,...,,,,PGIS,-9,-9,1,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,0,,182,Somalia,11,Sub-Saharan Africa,...,"""Somalia: Al-Shabaab Militants Attack Army Che...","""Highlights: Somalia Daily Media Highlights 2 ...","""Highlights: Somalia Daily Media Highlights 1 ...",START Primary Collection,0,0,0,0,,3.0
181687,201712310029,2017,12,31,0,,200,Syria,10,Middle East & North Africa,...,"""Putin's 'victory' in Syria has turned into a ...","""Two Russian soldiers killed at Hmeymim base i...","""Two Russian servicemen killed in Syria mortar...",START Primary Collection,-9,-9,1,1,,9.0
181688,201712310030,2017,12,31,0,,160,Philippines,5,Southeast Asia,...,"""Maguindanao clashes trap tribe members,"" Phil...",,,START Primary Collection,0,0,0,0,,0.0
181689,201712310031,2017,12,31,0,,92,India,6,South Asia,...,"""Trader escapes grenade attack in Imphal,"" Bus...",,,START Primary Collection,-9,-9,0,-9,,0.0


How to drop all missing values? Replace these missing values with others?

In [None]:
df.dropna(inplace=True)

**Task!** Use a function to replace NaNs (=missing values) to a string 'None' in ```related``` column

In [109]:
df

Unnamed: 0,eventid,year,imonth,iday,extended,resolution,country,country_txt,region,region_txt,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
0,197000000001,1970,7,2,0,,58,Dominican Republic,2,Central America & Caribbean,...,,,,PGIS,0,0,0,0,,1.0
1,197000000002,1970,0,0,0,,130,Mexico,1,North America,...,,,,PGIS,0,1,1,1,,0.0
2,197001000001,1970,1,0,0,,160,Philippines,5,Southeast Asia,...,,,,PGIS,-9,-9,1,1,,1.0
3,197001000002,1970,1,0,0,,78,Greece,8,Western Europe,...,,,,PGIS,-9,-9,1,1,,
4,197001000003,1970,1,0,0,,101,Japan,4,East Asia,...,,,,PGIS,-9,-9,1,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,0,,182,Somalia,11,Sub-Saharan Africa,...,"""Somalia: Al-Shabaab Militants Attack Army Che...","""Highlights: Somalia Daily Media Highlights 2 ...","""Highlights: Somalia Daily Media Highlights 1 ...",START Primary Collection,0,0,0,0,,3.0
181687,201712310029,2017,12,31,0,,200,Syria,10,Middle East & North Africa,...,"""Putin's 'victory' in Syria has turned into a ...","""Two Russian soldiers killed at Hmeymim base i...","""Two Russian servicemen killed in Syria mortar...",START Primary Collection,-9,-9,1,1,,9.0
181688,201712310030,2017,12,31,0,,160,Philippines,5,Southeast Asia,...,"""Maguindanao clashes trap tribe members,"" Phil...",,,START Primary Collection,0,0,0,0,,0.0
181689,201712310031,2017,12,31,0,,92,India,6,South Asia,...,"""Trader escapes grenade attack in Imphal,"" Bus...",,,START Primary Collection,-9,-9,0,-9,,0.0


For the selected columns show its mean, median (and/or mode).

In [110]:
df['year'].mean()

2002.6389969783863

In [111]:
df['year'].median()

2009.0

In [112]:
df['year'].mode()

0    2014
dtype: int64

Min, max and sum:

In [113]:
df['year'].min()

1970

In [114]:
df['year'].max()

2017

In [115]:
min(df['year'])

1970

In [118]:
min([4, 6 , 7, 10])

4

Filter the dataset to look only at the attacks after 2015 year

In [121]:
df[df['year'] > 2015]

Unnamed: 0,eventid,year,imonth,iday,extended,resolution,country,country_txt,region,region_txt,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
157202,201601010003,2016,1,1,0,,95,Iraq,10,Middle East & North Africa,...,"""Iraq: Roundup of Security Incidents 29 Decemb...",,,START Primary Collection,-9,-9,0,-9,,7.0
157203,201601010004,2016,1,1,0,,95,Iraq,10,Middle East & North Africa,...,"""Iraq: Roundup of Security Incidents 29 Decemb...",,,START Primary Collection,-9,-9,0,-9,,9.0
157204,201601010005,2016,1,1,0,,95,Iraq,10,Middle East & North Africa,...,"""Iraq: Roundup of Security Incidents 29 Decemb...","""Terrorism: Transcript of ISIL's Al-Bayan Radi...",,START Primary Collection,0,1,0,1,,5.0
157205,201601010008,2016,1,1,0,,92,India,6,South Asia,...,"""Terrorists abduct SP, friends in Punjab,"" Hin...","""Punjab on alert after cop thrashed by men in ...","""Alert in Punjab after SP's abduction,"" India ...",START Primary Collection,-9,-9,0,-9,"201601010008, 201601010009",3.0
157206,201601010009,2016,1,1,0,,92,India,6,South Asia,...,"""Terrorists abduct SP, friends in Punjab,"" Hin...","""Punjab on alert after cop thrashed by men in ...","""Alert in Punjab after SP's abduction,"" India ...",START Primary Collection,-9,-9,0,-9,"201601010008, 201601010009",1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,0,,182,Somalia,11,Sub-Saharan Africa,...,"""Somalia: Al-Shabaab Militants Attack Army Che...","""Highlights: Somalia Daily Media Highlights 2 ...","""Highlights: Somalia Daily Media Highlights 1 ...",START Primary Collection,0,0,0,0,,3.0
181687,201712310029,2017,12,31,0,,200,Syria,10,Middle East & North Africa,...,"""Putin's 'victory' in Syria has turned into a ...","""Two Russian soldiers killed at Hmeymim base i...","""Two Russian servicemen killed in Syria mortar...",START Primary Collection,-9,-9,1,1,,9.0
181688,201712310030,2017,12,31,0,,160,Philippines,5,Southeast Asia,...,"""Maguindanao clashes trap tribe members,"" Phil...",,,START Primary Collection,0,0,0,0,,0.0
181689,201712310031,2017,12,31,0,,92,India,6,South Asia,...,"""Trader escapes grenade attack in Imphal,"" Bus...",,,START Primary Collection,-9,-9,0,-9,,0.0


What if we have several conditions? Try it out

In [122]:
df[(df['year'] > 2015) & (df['country_txt'] == 'Iraq')]

Unnamed: 0,eventid,year,imonth,iday,extended,resolution,country,country_txt,region,region_txt,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
157202,201601010003,2016,1,1,0,,95,Iraq,10,Middle East & North Africa,...,"""Iraq: Roundup of Security Incidents 29 Decemb...",,,START Primary Collection,-9,-9,0,-9,,7.0
157203,201601010004,2016,1,1,0,,95,Iraq,10,Middle East & North Africa,...,"""Iraq: Roundup of Security Incidents 29 Decemb...",,,START Primary Collection,-9,-9,0,-9,,9.0
157204,201601010005,2016,1,1,0,,95,Iraq,10,Middle East & North Africa,...,"""Iraq: Roundup of Security Incidents 29 Decemb...","""Terrorism: Transcript of ISIL's Al-Bayan Radi...",,START Primary Collection,0,1,0,1,,5.0
157218,201601010024,2016,1,1,0,,95,Iraq,10,Middle East & North Africa,...,"""ISIS suicide bombers attack Iraqi forces at b...","""Iraqi forces attempt to clear 'IS' pockets ou...","""ISIS attacks headquarters of Iraqi army's 10t...",START Primary Collection,0,1,0,1,,28.0
157219,201601010025,2016,1,1,0,,95,Iraq,10,Middle East & North Africa,...,"""Blast in Iraq's Al-Ramadi kills sixteen,"" Al-...",,,START Primary Collection,-9,-9,0,-9,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181669,201712310002,2017,12,30,0,,95,Iraq,10,Middle East & North Africa,...,"""Five civilians killed in two bomb attacks in ...","""Iraq: Security Roundup 1900 GMT 31 December 2...","""Iraq: Roundup of Violent Activities Targeting...",START Primary Collection,-9,-9,0,-9,"201712300005, 201712310002",5.0
181670,201712310003,2017,12,31,0,,95,Iraq,10,Middle East & North Africa,...,"""Iraq: Security Roundup 1900 GMT 31 December 2...","""Terrorism: Roundup of Official ISIS Messages ...",,START Primary Collection,0,1,0,1,,1.0
181671,201712310004,2017,12,31,0,,95,Iraq,10,Middle East & North Africa,...,"""Islamic State attack leaves thirteen civilian...","""3 people killed in IS attack in central Iraq,...","""Daesh gunmen kill 3 in northern Iraq: Police,...",START Primary Collection,0,1,0,1,,13.0
181674,201712310007,2017,12,31,0,,95,Iraq,10,Middle East & North Africa,...,"""Five IS militants killed as paramilitary troo...","""1,262 Killed in Iraq During December,"" Antiwa...",,START Primary Collection,0,1,0,1,,5.0


Additional materials:

* https://www.kaggle.com/START-UMD/gtd/code?datasetId=504&sortBy=voteCount