# <center> Pandas*</center>

*pandas is short for Python Data Analysis Library

<img src="https://welovepandas.club/wp-content/uploads/2019/02/panda-bamboo1550035127.jpg" height=350 width=400>

In [2]:
import pandas as pd

In pandas you need to work with DataFrames and Series. According to [the documentation of pandas](https://pandas.pydata.org/pandas-docs/stable/):

* **DataFrame**: Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

* **Series**: One-dimensional ndarray with axis labels (including time series).

In [5]:
pd.Series([5, 6, 7, 8, 9, 10])

0     5
1     6
2     7
3     8
4     9
5    10
dtype: int64

In [9]:
pd.DataFrame([1])

Unnamed: 0,0
0,1


In [11]:
some_data = {'Student': ['1', '2'], 'Name': ['Alice', 'Michael'], 'Surname': ['Brown', 'Williams']}

pd.DataFrame(some_data)

Unnamed: 0,Student,Name,Surname
0,1,Alice,Brown
1,2,Michael,Williams


In [12]:
some_data = [{'Student': ['1', '2'], 'Name': ['Alice', 'Michael'], 'Surname': ['Brown', 'Williams']}]

pd.DataFrame(some_data)

Unnamed: 0,Student,Name,Surname
0,"[1, 2]","[Alice, Michael]","[Brown, Williams]"


In [13]:
pd.DataFrame([{'Student': '1', 'Name': 'Alice', 'Surname': 'Brown'}, 
            {'Student': '2', 'Name': 'Anna', 'Surname': 'White'}])

Unnamed: 0,Student,Name,Surname
0,1,Alice,Brown
1,2,Anna,White


Check how to create it:
* pd.DataFrame().from_records()
* pd.DataFrame().from_dict()

In [15]:
pd.DataFrame.from_records(some_data)

Unnamed: 0,Student,Name,Surname
0,"[1, 2]","[Alice, Michael]","[Brown, Williams]"


In [None]:
pd.DataFrame.from_dict()

This data set is too big for github, download it from [here](https://www.kaggle.com/START-UMD/gtd). You will need to register on Kaggle first.

In [16]:
df = pd.read_csv('globalterrorismdb_0718dist.csv', encoding='ISO-8859-1')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Let's explore the second set of data. How many rows and columns are there?

In [19]:
df.shape

(181691, 135)

General information on this data set:

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Columns: 135 entries, eventid to related
dtypes: float64(55), int64(22), object(58)
memory usage: 187.1+ MB


Let's take a look at the dataset information. In .info (), you can pass additional parameters, including:

* **verbose**: whether to print information about the DataFrame in full (if the table is very large, then some information may be lost);
* **memory_usage**: whether to print memory consumption (the default is True, but you can put either False, which will remove memory consumption, or 'deep', which will calculate the memory consumption more accurately);
* **null_counts**: Whether to count the number of empty elements (default is True).

In [21]:
df.describe()

Unnamed: 0,eventid,iyear,imonth,iday,extended,country,region,latitude,longitude,specificity,...,ransomamt,ransomamtus,ransompaid,ransompaidus,hostkidoutcome,nreleased,INT_LOG,INT_IDEO,INT_MISC,INT_ANY
count,181691.0,181691.0,181691.0,181691.0,181691.0,181691.0,181691.0,177135.0,177134.0,181685.0,...,1350.0,563.0,774.0,552.0,10991.0,10400.0,181691.0,181691.0,181691.0,181691.0
mean,200270500000.0,2002.638997,6.467277,15.505644,0.045346,131.968501,7.160938,23.498343,-458.6957,1.451452,...,3172530.0,578486.5,717943.7,240.378623,4.629242,-29.018269,-4.543731,-4.464398,0.09001,-3.945952
std,1325957000.0,13.25943,3.388303,8.814045,0.208063,112.414535,2.933408,18.569242,204779.0,0.99543,...,30211570.0,7077924.0,10143920.0,2940.967293,2.03536,65.720119,4.543547,4.637152,0.568457,4.691325
min,197000000000.0,1970.0,0.0,0.0,0.0,4.0,1.0,-53.154613,-86185900.0,1.0,...,-99.0,-99.0,-99.0,-99.0,1.0,-99.0,-9.0,-9.0,-9.0,-9.0
25%,199102100000.0,1991.0,4.0,8.0,0.0,78.0,5.0,11.510046,4.54564,1.0,...,0.0,0.0,-99.0,0.0,2.0,-99.0,-9.0,-9.0,0.0,-9.0
50%,200902200000.0,2009.0,6.0,15.0,0.0,98.0,6.0,31.467463,43.24651,1.0,...,15000.0,0.0,0.0,0.0,4.0,0.0,-9.0,-9.0,0.0,0.0
75%,201408100000.0,2014.0,9.0,23.0,0.0,160.0,10.0,34.685087,68.71033,1.0,...,400000.0,0.0,1273.412,0.0,7.0,1.0,0.0,0.0,0.0,0.0
max,201712300000.0,2017.0,12.0,31.0,1.0,1004.0,12.0,74.633553,179.3667,5.0,...,1000000000.0,132000000.0,275000000.0,48000.0,7.0,2769.0,1.0,1.0,1.0,1.0


In [25]:
df.describe(include=['object', 'int'])

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
count,181691.0,181691.0,181691.0,181691.0,9239,181691.0,2220,181691.0,181691,181691.0,...,28289,115500,76933,43516,181691,181691.0,181691.0,181691.0,181691.0,25038
unique,,,,,2244,,1859,,205,,...,15429,83988,62263,36090,26,,,,,14306
top,,,,,"September 18-24, 2016",,8/4/1998,,Iraq,,...,Casualty numbers for this incident conflict ac...,Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...","Christopher Hewitt, ""Political Violence and Te...",START Primary Collection,,,,,"201612010023, 201612010024, 201612010025, 2016..."
freq,,,,,101,,18,,24636,,...,1607,205,134,139,78002,,,,,80
mean,200270500000.0,2002.638997,6.467277,15.505644,,0.045346,,131.968501,,7.160938,...,,,,,,-4.543731,-4.464398,0.09001,-3.945952,
std,1325957000.0,13.25943,3.388303,8.814045,,0.208063,,112.414535,,2.933408,...,,,,,,4.543547,4.637152,0.568457,4.691325,
min,197000000000.0,1970.0,0.0,0.0,,0.0,,4.0,,1.0,...,,,,,,-9.0,-9.0,-9.0,-9.0,
25%,199102100000.0,1991.0,4.0,8.0,,0.0,,78.0,,5.0,...,,,,,,-9.0,-9.0,0.0,-9.0,
50%,200902200000.0,2009.0,6.0,15.0,,0.0,,98.0,,6.0,...,,,,,,-9.0,-9.0,0.0,0.0,
75%,201408100000.0,2014.0,9.0,23.0,,0.0,,160.0,,10.0,...,,,,,,0.0,0.0,0.0,0.0,


The describe method shows the basic statistical characteristics of the data for each numeric feature (int64 and float64 types): the number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

How to look only at the column names, index:

In [31]:
df.columns

Index(['eventid', 'iyear', 'imonth', 'iday', 'approxdate', 'extended',
       'resolution', 'country', 'country_txt', 'region',
       ...
       'addnotes', 'scite1', 'scite2', 'scite3', 'dbsource', 'INT_LOG',
       'INT_IDEO', 'INT_MISC', 'INT_ANY', 'related'],
      dtype='object', length=135)

In [34]:
df.index

RangeIndex(start=0, stop=181691, step=1)

How to look at the first 10 lines?

In [36]:
df.head(10)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,
5,197001010002,1970,1,1,,0,,217,United States,1,...,"The Cairo Chief of Police, William Petersen, r...","""Police Chief Quits,"" Washington Post, January...","""Cairo Police Chief Quits; Decries Local 'Mili...","Christopher Hewitt, ""Political Violence and Te...",Hewitt Project,-9,-9,0,-9,
6,197001020001,1970,1,2,,0,,218,Uruguay,3,...,,,,,PGIS,0,0,0,0,
7,197001020002,1970,1,2,,0,,217,United States,1,...,"Damages were estimated to be between $20,000-$...",Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...",,Hewitt Project,-9,-9,0,-9,
8,197001020003,1970,1,2,,0,,217,United States,1,...,The New Years Gang issue a communiqué to a loc...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...","The Wisconsin Cartographers' Guild, ""Wisconsin...",Hewitt Project,0,0,0,0,
9,197001030001,1970,1,3,,0,,217,United States,1,...,"Karl Armstrong's girlfriend, Lynn Schultz, dro...",Committee on Government Operations United Stat...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...",Hewitt Project,0,0,0,0,


How to look at the last 15 lines?

In [38]:
df.tail(15)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
181676,201712310009,2017,12,31,,0,,4,Afghanistan,6,...,The victims included police commander Faqeer A...,"""Commander among 5 ALP members killed in Logar...","""Media Highlights on Afghanistan 1 January 201...",,START Primary Collection,0,0,0,0,
181677,201712310010,2017,12,31,,0,,160,Philippines,5,...,,"""3 slain in Maguindanao roadside bombings,"" Ph...","""BIFF gunmen torch abandoned houses in Maguind...","""Philippines: Highlights of Terrorist, Counter...",START Primary Collection,0,0,0,0,
181678,201712310011,2017,12,30,,0,,160,Philippines,5,...,"The victims included the owner, Norodin Pacaln...","""Cops hunt North Cotabato bombers,"" Philippine...","""Philippines: Highlights of Terrorist, Counter...",,START Primary Collection,-9,-9,0,-9,
181679,201712310012,2017,12,31,,0,,95,Iraq,10,...,,"""13 IS militants killed in attack on paramilit...",,,START Primary Collection,0,1,0,1,
181680,201712310013,2017,12,31,,0,,182,Somalia,11,...,,"""Somalia's al-Shabab fires mortars at Ethiopia...","""Somalia: Al-Shabaab Militants Shell Ethiopian...",,START Primary Collection,0,1,1,1,
181681,201712310016,2017,12,31,,0,,160,Philippines,5,...,The victims included Senior Police Officer 4 M...,"""3 dead, scores injured in Mindanao blasts,"" M...","""Cop, 2 others killed in bomb blasts in Mindan...","""Cop killed, 7 injured in Maguindanao IED blas...",START Primary Collection,0,0,0,0,
181682,201712310017,2017,12,31,,0,,98,Italy,8,...,,"""Arson attack probed as racial crime,"" Ansa.it...","""Ascoli, a building destined for migrants goes...",,START Primary Collection,-9,-9,0,-9,
181683,201712310018,2017,12,31,,0,,4,Afghanistan,6,...,,"""Six Members Of One Family Shot Dead In Faryab...","""Highlights: Pakistan Pashto Press 02 January ...",,START Primary Collection,0,0,0,0,
181684,201712310019,2017,12,31,,0,,92,India,6,...,,"""Abducted PSO rescued within 11 hours,"" The Se...",,,START Primary Collection,0,0,0,0,
181685,201712310020,2017,12,31,,0,,4,Afghanistan,6,...,,"""4 people injured in Farayb explosion,"" Pajhwo...",,,START Primary Collection,-9,-9,0,-9,


How to request only one particular line (by counting lines)? 

In [42]:
df.head(4)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,


In [47]:
#the first 3 lines
df.iloc[:3] # the number of rows by counting them

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,


How to request only one particular line by  its index?

In [49]:
# the first lines till the row with the index 3
df.loc[:3] # 3 is treated as an index

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,


Look only at the unique values of some columns. 

In [52]:
list(df['city'].unique())

['Santo Domingo',
 'Mexico city',
 'Unknown',
 'Athens',
 'Fukouka',
 'Cairo',
 'Montevideo',
 'Oakland',
 'Madison',
 'Baraboo',
 'Denver',
 'Rome',
 'Detroit',
 'Rio Piedras',
 'Berlin',
 'New York City',
 'Rio Grande',
 'Seattle',
 'Champaign',
 'Jersey City',
 'Guatemala City',
 'Quezon City',
 'Caracas',
 'South Sioux City',
 'West Point',
 'Norwalk',
 'Coral Gables',
 'Bamban',
 'Portland',
 'Akron',
 'Dorado',
 'Carolina',
 'Boston',
 'Whitewater',
 'Batavia',
 'Munich',
 'Ypsilanti',
 'Berkeley',
 'Eugene',
 'San Francisco',
 'Buckeystown',
 'Covington',
 'Cleveland',
 'Vallejo',
 'Hartford',
 'Frankfurt',
 'Zurich',
 'Ithaca',
 'Prairie du Sac',
 'Tucson',
 'Boulder',
 'Hebron',
 'Manila',
 'Colorado Springs',
 'Martinez',
 'San Juan',
 'Ashville',
 'Bridgeport',
 'Albuquerque',
 'Bel Air',
 'Cambridge',
 'Sao Paulo',
 'Chicago',
 'Appleton',
 'Alexandria',
 'Long Beach',
 'Billings',
 'San Bernardino',
 'Los Angeles',
 'Lockland',
 'Washington',
 'Orlando',
 'Angeles',
 'Ituz

How many unique values there are in ```city``` column? = On how many cities this data set hold information on terrorist attacks?

In [53]:
df['city'].nunique()

36674

In what years did the largest number of terrorist attacks occur (according to only to this data set)?

In [56]:
df['iyear'].value_counts().head(5)

2014    16903
2015    14965
2016    13587
2013    12036
2017    10900
Name: iyear, dtype: int64

In [60]:
df['iyear'].value_counts()[:5]

2014    16903
2015    14965
2016    13587
2013    12036
2017    10900
Name: iyear, dtype: int64

How we can sort all data by year in descending order?

In [65]:
df['iyear'].sort_values()

0         1970
430       1970
431       1970
432       1970
433       1970
          ... 
174425    2017
174426    2017
174427    2017
174419    2017
181690    2017
Name: iyear, Length: 181691, dtype: int64

In [64]:
df.sort_values(by='iyear', ascending=False)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
181690,201712310032,2017,12,31,,0,,160,Philippines,5,...,,"""Security tightened in Cotabato following IED ...","""Security tightened in Cotabato City,"" Manila ...",,START Primary Collection,-9,-9,0,-9,
174420,201705020030,2017,5,2,"April 29-May 4, 2017",0,,4,Afghanistan,6,...,,"""Afghan forces foiled 3 explosions in less tha...",,,START Primary Collection,-9,-9,1,1,
174428,201705030001,2017,5,2,,0,,95,Iraq,10,...,Casualty numbers conflict across sources. Foll...,"""10 soldiers killed in ISIL attack in W. Iraq,...","""2 Police Officers, 1 Soldier Die in Islamic S...","""Iraq: Security Roundup 1900 GMT 04 May 2016,""...",START Primary Collection,0,1,0,1,
174427,201705020037,2017,5,2,,0,,45,Colombia,3,...,,"""One dead and three injured leaves attack in r...","""Colombia Guerrilla Update: ELN Rebels Ambush ...",,START Primary Collection,0,0,0,0,
174426,201705020036,2017,5,2,,0,,69,France,8,...,,"""A new clandestine group claims attacks in Cor...","""Credit Agricole attack in Biguglia: the bank ...","""Upper Corsica: a bank targeted by a gas cylin...",START Primary Collection,0,0,0,0,"201705020035, 201705020036"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
434,197007270005,1970,7,27,,0,,217,United States,1,...,Part of a multiple attack with 197007270004. ...,"""Army Target at Coast Blasts,"" New York Times,...","""Bomb Breaks Glass In Police Building,"" Modest...","""Guerrilla Acts of Sabotage and Terrorism in t...",Hewitt Project,-9,-9,0,-9,"197007270004, 197007270005"
433,197007270004,1970,7,27,,0,,217,United States,1,...,Part of a multiple attack with 197007270005. ...,"""N.Y. Bank Damaged by Pipe Bomb,"" New York Tim...","""Army Target at Coast Blasts,"" New York Times,...","""Bomb Breaks Glass In Police Building,"" Modest...",Hewitt Project,-9,-9,0,-9,"197007270004, 197007270005"
432,197007270003,1970,7,26,,0,,217,United States,1,...,,,,,PGIS,0,0,0,0,
431,197007270002,1970,7,26,,0,,217,United States,1,...,,,,,PGIS,0,0,0,0,


Which data types we have in each column?

In [70]:
dict(df.dtypes)

{'eventid': dtype('int64'),
 'iyear': dtype('int64'),
 'imonth': dtype('int64'),
 'iday': dtype('int64'),
 'approxdate': dtype('O'),
 'extended': dtype('int64'),
 'resolution': dtype('O'),
 'country': dtype('int64'),
 'country_txt': dtype('O'),
 'region': dtype('int64'),
 'region_txt': dtype('O'),
 'provstate': dtype('O'),
 'city': dtype('O'),
 'latitude': dtype('float64'),
 'longitude': dtype('float64'),
 'specificity': dtype('float64'),
 'vicinity': dtype('int64'),
 'location': dtype('O'),
 'summary': dtype('O'),
 'crit1': dtype('int64'),
 'crit2': dtype('int64'),
 'crit3': dtype('int64'),
 'doubtterr': dtype('float64'),
 'alternative': dtype('float64'),
 'alternative_txt': dtype('O'),
 'multiple': dtype('float64'),
 'success': dtype('int64'),
 'suicide': dtype('int64'),
 'attacktype1': dtype('int64'),
 'attacktype1_txt': dtype('O'),
 'attacktype2': dtype('float64'),
 'attacktype2_txt': dtype('O'),
 'attacktype3': dtype('float64'),
 'attacktype3_txt': dtype('O'),
 'targtype1': dtype(

How to check the missing values?

In [72]:
df

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,,0,,182,Somalia,11,...,,"""Somalia: Al-Shabaab Militants Attack Army Che...","""Highlights: Somalia Daily Media Highlights 2 ...","""Highlights: Somalia Daily Media Highlights 1 ...",START Primary Collection,0,0,0,0,
181687,201712310029,2017,12,31,,0,,200,Syria,10,...,,"""Putin's 'victory' in Syria has turned into a ...","""Two Russian soldiers killed at Hmeymim base i...","""Two Russian servicemen killed in Syria mortar...",START Primary Collection,-9,-9,1,1,
181688,201712310030,2017,12,31,,0,,160,Philippines,5,...,,"""Maguindanao clashes trap tribe members,"" Phil...",,,START Primary Collection,0,0,0,0,
181689,201712310031,2017,12,31,,0,,92,India,6,...,,"""Trader escapes grenade attack in Imphal,"" Bus...",,,START Primary Collection,-9,-9,0,-9,


In [71]:
df.isna()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
1,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
2,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
3,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
4,False,False,False,False,True,False,True,False,False,False,...,True,True,True,True,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,False,False,False,False,True,False,True,False,False,False,...,True,False,False,False,False,False,False,False,False,True
181687,False,False,False,False,True,False,True,False,False,False,...,True,False,False,False,False,False,False,False,False,True
181688,False,False,False,False,True,False,True,False,False,False,...,True,False,True,True,False,False,False,False,False,True
181689,False,False,False,False,True,False,True,False,False,False,...,True,False,True,True,False,False,False,False,False,True


In [74]:
dict(df.isna().sum())

{'eventid': 0,
 'iyear': 0,
 'imonth': 0,
 'iday': 0,
 'approxdate': 172452,
 'extended': 0,
 'resolution': 179471,
 'country': 0,
 'country_txt': 0,
 'region': 0,
 'region_txt': 0,
 'provstate': 421,
 'city': 434,
 'latitude': 4556,
 'longitude': 4557,
 'specificity': 6,
 'vicinity': 0,
 'location': 126196,
 'summary': 66129,
 'crit1': 0,
 'crit2': 0,
 'crit3': 0,
 'doubtterr': 1,
 'alternative': 152680,
 'alternative_txt': 152680,
 'multiple': 1,
 'success': 0,
 'suicide': 0,
 'attacktype1': 0,
 'attacktype1_txt': 0,
 'attacktype2': 175377,
 'attacktype2_txt': 175377,
 'attacktype3': 181263,
 'attacktype3_txt': 181263,
 'targtype1': 0,
 'targtype1_txt': 0,
 'targsubtype1': 10373,
 'targsubtype1_txt': 10373,
 'corp1': 42550,
 'target1': 636,
 'natlty1': 1559,
 'natlty1_txt': 1559,
 'targtype2': 170547,
 'targtype2_txt': 170547,
 'targsubtype2': 171006,
 'targsubtype2_txt': 171006,
 'corp2': 171574,
 'target2': 170671,
 'natlty2': 170863,
 'natlty2_txt': 170863,
 'targtype3': 180515,
 

In [77]:
df.dropna(axis=1)

Unnamed: 0,eventid,iyear,imonth,iday,extended,country,country_txt,region,region_txt,vicinity,...,gname,individual,weaptype1,weaptype1_txt,property,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY
0,197000000001,1970,7,2,0,58,Dominican Republic,2,Central America & Caribbean,0,...,MANO-D,0,13,Unknown,0,PGIS,0,0,0,0
1,197000000002,1970,0,0,0,130,Mexico,1,North America,0,...,23rd of September Communist League,0,13,Unknown,0,PGIS,0,1,1,1
2,197001000001,1970,1,0,0,160,Philippines,5,Southeast Asia,0,...,Unknown,0,13,Unknown,0,PGIS,-9,-9,1,1
3,197001000002,1970,1,0,0,78,Greece,8,Western Europe,0,...,Unknown,0,6,Explosives,1,PGIS,-9,-9,1,1
4,197001000003,1970,1,0,0,101,Japan,4,East Asia,0,...,Unknown,0,8,Incendiary,1,PGIS,-9,-9,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,0,182,Somalia,11,Sub-Saharan Africa,0,...,Al-Shabaab,0,5,Firearms,-9,START Primary Collection,0,0,0,0
181687,201712310029,2017,12,31,0,200,Syria,10,Middle East & North Africa,1,...,Muslim extremists,0,6,Explosives,1,START Primary Collection,-9,-9,1,1
181688,201712310030,2017,12,31,0,160,Philippines,5,Southeast Asia,0,...,Bangsamoro Islamic Freedom Movement (BIFM),0,8,Incendiary,1,START Primary Collection,0,0,0,0
181689,201712310031,2017,12,31,0,92,India,6,South Asia,0,...,Unknown,0,6,Explosives,-9,START Primary Collection,-9,-9,0,-9


In [80]:
df.head(5)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,


In [88]:
df['attacktype2'].min()

1.0

In [89]:
df['attacktype2'].max()

9.0

In [90]:
df['attacktype2'].mode()

0    2.0
dtype: float64

In [91]:
df['attacktype2'].median()

2.0

In [92]:
df['attacktype2'].mean()

3.7195121951219514

In [93]:
df['attacktype2'].fillna(df['attacktype2'].mode())

0         2.0
1         NaN
2         NaN
3         NaN
4         NaN
         ... 
181686    NaN
181687    NaN
181688    NaN
181689    NaN
181690    NaN
Name: attacktype2, Length: 181691, dtype: float64

Let's delete a column ```approxdate``` from this data set, because it contains a lot of missing values:

In [None]:
df.drop(['approxdate'], axis=1, inplace=True)

Create a new variable ```casualties``` by summing up the value in ```Killed``` and ```Wounded```. 

In [98]:
set(df.columns)

{'INT_ANY',
 'INT_IDEO',
 'INT_LOG',
 'INT_MISC',
 'addnotes',
 'alternative',
 'alternative_txt',
 'approxdate',
 'attacktype1',
 'attacktype1_txt',
 'attacktype2',
 'attacktype2_txt',
 'attacktype3',
 'attacktype3_txt',
 'city',
 'claim2',
 'claim3',
 'claimed',
 'claimmode',
 'claimmode2',
 'claimmode2_txt',
 'claimmode3',
 'claimmode3_txt',
 'claimmode_txt',
 'compclaim',
 'corp1',
 'corp2',
 'corp3',
 'country',
 'country_txt',
 'crit1',
 'crit2',
 'crit3',
 'dbsource',
 'divert',
 'doubtterr',
 'eventid',
 'extended',
 'gname',
 'gname2',
 'gname3',
 'gsubname',
 'gsubname2',
 'gsubname3',
 'guncertain1',
 'guncertain2',
 'guncertain3',
 'hostkidoutcome',
 'hostkidoutcome_txt',
 'iday',
 'imonth',
 'individual',
 'ishostkid',
 'iyear',
 'kidhijcountry',
 'latitude',
 'location',
 'longitude',
 'motive',
 'multiple',
 'natlty1',
 'natlty1_txt',
 'natlty2',
 'natlty2_txt',
 'natlty3',
 'natlty3_txt',
 'ndays',
 'nhostkid',
 'nhostkidus',
 'nhours',
 'nkill',
 'nkillter',
 'nkillus'

In [104]:
df['casualties'] = df['nwound'] + df['nkill']

In [105]:
df.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,PGIS,0,0,0,0,,1.0
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,PGIS,0,1,1,1,,0.0
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,PGIS,-9,-9,1,1,,1.0
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,PGIS,-9,-9,1,1,,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,PGIS,-9,-9,1,1,,


Rename a column ```iyear``` to ```Year```:

In [108]:
df.rename({'iyear' : 'Year'}, axis='columns', inplace=True)

In [110]:
df

Unnamed: 0,eventid,Year,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,PGIS,0,0,0,0,,1.0
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,PGIS,0,1,1,1,,0.0
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,PGIS,-9,-9,1,1,,1.0
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,PGIS,-9,-9,1,1,,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,PGIS,-9,-9,1,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,,0,,182,Somalia,11,...,"""Somalia: Al-Shabaab Militants Attack Army Che...","""Highlights: Somalia Daily Media Highlights 2 ...","""Highlights: Somalia Daily Media Highlights 1 ...",START Primary Collection,0,0,0,0,,3.0
181687,201712310029,2017,12,31,,0,,200,Syria,10,...,"""Putin's 'victory' in Syria has turned into a ...","""Two Russian soldiers killed at Hmeymim base i...","""Two Russian servicemen killed in Syria mortar...",START Primary Collection,-9,-9,1,1,,9.0
181688,201712310030,2017,12,31,,0,,160,Philippines,5,...,"""Maguindanao clashes trap tribe members,"" Phil...",,,START Primary Collection,0,0,0,0,,0.0
181689,201712310031,2017,12,31,,0,,92,India,6,...,"""Trader escapes grenade attack in Imphal,"" Bus...",,,START Primary Collection,-9,-9,0,-9,,0.0


How to drop all missing values? Replace these missing values with others?

In [None]:
df.dropna(inplace=True)

**Task!** Use a function to replace NaNs (=missing values) to a string 'None' in ```related``` column

In [None]:
# TODO

For the selected columns show its mean, median (and/or mode).

In [111]:
df['Year'].mean()

Unnamed: 0,eventid,Year,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,PGIS,0,0,0,0,,1.0
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,PGIS,0,1,1,1,,0.0
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,PGIS,-9,-9,1,1,,1.0
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,PGIS,-9,-9,1,1,,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,PGIS,-9,-9,1,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,,0,,182,Somalia,11,...,"""Somalia: Al-Shabaab Militants Attack Army Che...","""Highlights: Somalia Daily Media Highlights 2 ...","""Highlights: Somalia Daily Media Highlights 1 ...",START Primary Collection,0,0,0,0,,3.0
181687,201712310029,2017,12,31,,0,,200,Syria,10,...,"""Putin's 'victory' in Syria has turned into a ...","""Two Russian soldiers killed at Hmeymim base i...","""Two Russian servicemen killed in Syria mortar...",START Primary Collection,-9,-9,1,1,,9.0
181688,201712310030,2017,12,31,,0,,160,Philippines,5,...,"""Maguindanao clashes trap tribe members,"" Phil...",,,START Primary Collection,0,0,0,0,,0.0
181689,201712310031,2017,12,31,,0,,92,India,6,...,"""Trader escapes grenade attack in Imphal,"" Bus...",,,START Primary Collection,-9,-9,0,-9,,0.0


Min, max and sum:

In [116]:
df['Year'].sum()

363861482

In [117]:
sum(df['Year'])

363861482

In [121]:
max('word')

'w'

Filter the dataset to look only at the attacks after 2015 year

In [123]:
df[df.Year > 2015]

Unnamed: 0,eventid,Year,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
157202,201601010003,2016,1,1,2016-01-01 00:00:00,0,,95,Iraq,10,...,"""Iraq: Roundup of Security Incidents 29 Decemb...",,,START Primary Collection,-9,-9,0,-9,,7.0
157203,201601010004,2016,1,1,,0,,95,Iraq,10,...,"""Iraq: Roundup of Security Incidents 29 Decemb...",,,START Primary Collection,-9,-9,0,-9,,9.0
157204,201601010005,2016,1,1,2016-01-01 00:00:00,0,,95,Iraq,10,...,"""Iraq: Roundup of Security Incidents 29 Decemb...","""Terrorism: Transcript of ISIL's Al-Bayan Radi...",,START Primary Collection,0,1,0,1,,5.0
157205,201601010008,2016,1,1,,0,,92,India,6,...,"""Terrorists abduct SP, friends in Punjab,"" Hin...","""Punjab on alert after cop thrashed by men in ...","""Alert in Punjab after SP's abduction,"" India ...",START Primary Collection,-9,-9,0,-9,"201601010008, 201601010009",3.0
157206,201601010009,2016,1,1,,0,,92,India,6,...,"""Terrorists abduct SP, friends in Punjab,"" Hin...","""Punjab on alert after cop thrashed by men in ...","""Alert in Punjab after SP's abduction,"" India ...",START Primary Collection,-9,-9,0,-9,"201601010008, 201601010009",1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181686,201712310022,2017,12,31,,0,,182,Somalia,11,...,"""Somalia: Al-Shabaab Militants Attack Army Che...","""Highlights: Somalia Daily Media Highlights 2 ...","""Highlights: Somalia Daily Media Highlights 1 ...",START Primary Collection,0,0,0,0,,3.0
181687,201712310029,2017,12,31,,0,,200,Syria,10,...,"""Putin's 'victory' in Syria has turned into a ...","""Two Russian soldiers killed at Hmeymim base i...","""Two Russian servicemen killed in Syria mortar...",START Primary Collection,-9,-9,1,1,,9.0
181688,201712310030,2017,12,31,,0,,160,Philippines,5,...,"""Maguindanao clashes trap tribe members,"" Phil...",,,START Primary Collection,0,0,0,0,,0.0
181689,201712310031,2017,12,31,,0,,92,India,6,...,"""Trader escapes grenade attack in Imphal,"" Bus...",,,START Primary Collection,-9,-9,0,-9,,0.0


What if we have several conditions? Try it out

In [124]:
df[(df.Year > 2015) & (df.extended == 1)]

Unnamed: 0,eventid,Year,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,casualties
157223,201601010029,2016,1,1,,1,,4,Afghanistan,6,...,"""Afghanistan: Two of Eight Kidnap Victims Foun...",,,START Primary Collection,0,0,0,0,,2.0
157228,201601010036,2016,1,1,,1,,195,Sudan,11,...,"""Highlights: Sudan Daily Media Highlights 14 J...",,,START Primary Collection,0,0,0,0,,28.0
157232,201601010042,2016,1,1,,1,,92,India,6,...,"""Tura trader abducted,"" Assam Tribune, January...","""One person injured in abduction bid in Meghal...",,START Primary Collection,0,0,0,0,,
157239,201601020006,2016,1,2,,1,,92,India,6,...,"""India air base attack: At least 8 dead,"" CNN,...","""Gunfire, blasts at Indian air base, two milit...","""Pakistan arrests Jaish members in connection ...",START Primary Collection,0,1,0,1,,32.0
157257,201601020043,2016,1,2,,1,,217,United States,1,...,"""Protesters occupy Oregon wildlife refuge as d...","""Oregon standoff: 4 holdouts all in FBI custod...","""Federal indictment says protesters in Oregon ...",START Primary Collection,0,0,0,0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181620,201712280040,2017,12,28,,1,,4,Afghanistan,6,...,"""4 Two Students And A Teacher Rescued From Kid...",,,START Primary Collection,-9,-9,0,-9,,0.0
181630,201712290008,2017,12,28,,1,,160,Philippines,5,...,"""NPA abducts police officer in Cotabato,"" Mani...","""Police official taken by armed men in North C...","""Philippines: Suspected communist rebels abduc...",START Primary Collection,0,0,0,0,,
181636,201712290017,2017,12,29,,1,,4,Afghanistan,6,...,"""Afghanistan- ED: Deash's threat more serious ...","""ISIS Abducted 12 Chaplains in Northern Jawzja...",,START Primary Collection,0,1,0,1,,
181655,201712300013,2017,12,30,,1,,147,Nigeria,11,...,"""5 soldiers reported killed in Boko Haram atta...","""Nigeria Scores Killed in Boko Haram Attacks i...","""BBCM Terrorism Digest: 1-2 January 2018,"" BBC...",START Primary Collection,0,0,0,0,,


Additional materials:

* https://www.kaggle.com/START-UMD/gtd/code?datasetId=504&sortBy=voteCount