# Exploratory Data Analysis with Python
<div style="
    border: 5px solid purple;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [1]:
import pandas as pd

In [None]:
#!pip install pandas

<div style="
    border: 3px solid purple;
    border-radius: 8px;
    padding: 12px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
     Your job is to makes sense of any dataset given and give a preliminary report.
    <ul>
      <li>What is the structure of the data?</li>
      <li>How clean is the dataset?</li>
      <li>Does it look real or was machine generated?</li>
      <li>Is it worth it to further analyse it?</li>
      <li>Are there some  interesting insights that can be pulled already?</li>
    </ul>
</div>

## The basics - Understanding a dataframe
<div style="
    border: 4px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

<div style="
    border: 3px solid orange;
    border-radius: 8px;
    padding: 12px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
A dataframe is a "size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure."
Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
</div>

### Building a dataframe from a dictionary
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [2]:
mydict = {
    "names": ["Gustavo", "Henrik", "Wanja", "Carlo", "Jannik"],
    "scores": [39, 34, 40, 49, 10],
    "fav_food": ["tacos", "pasta", "cake", "döner", "ice cream"]
}

In [3]:
#pandas library
df = pd.DataFrame(mydict)

In [4]:
df

Unnamed: 0,names,scores,fav_food
0,Gustavo,39,tacos
1,Henrik,34,pasta
2,Wanja,40,cake
3,Carlo,49,döner
4,Jannik,10,ice cream


### Importing data
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#from a csv file
df = pd.read_csv("datasets/socialmedia_engagement.csv")

In [None]:
#from an excel file --- need to install openpyxl dependency
df = pd.read_excel("datasets/happiness_2015-2019.xlsx")

In [None]:
#from github
username = "datagus"
repository = "statstutorial2025"
directory = "week5/airbnb_europe.csv"
github_url = f"https://raw.githubusercontent.com/{username}/{repository}/main/{directory}"
df = pd.read_csv(github_url)

In [6]:
#from a google spreadsheet
gsheet_id = "1wEGvOk504_wnFlv1D9Dw8IFIAaDMtwau"
url = f"https://docs.google.com/spreadsheets/d/{gsheet_id}/export?format=xlsx"
excel = pd.ExcelFile(url)
df = excel.parse("master table")

In [9]:
excel.sheet_names

['master table',
 'Tabellenblatt10',
 'read.me',
 'deleted_rows',
 'missing_articlePDFs_for_Paris']

In [None]:
!pip install openpyxl

### Inspecting the structure of a dataframe
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [12]:
df

Unnamed: 0,Downloaded (1/0),Authors,Author full names,Author(s) ID,Title,Year,Source title,Volume,Issue,Art. No.,...,"spatial scale (individual, local, regional, national, supranational, global)",snapshot in time vs. longitudinal study,"temporal scale (past, present, future)",qualitative/quantitative/mixed methods,location of the study (country),Dataset used (inductive),country of the institution of the first author,Is there an own subsection for policy recommendations? 1/0,Notes,Coding complete
0,1,Thiebes S.; Lins S.; Sunyaev A.,"Thiebes, Scott (56319399400); Lins, Sebastian ...",56319399400; 56318996100; 24779131200,Trustworthy artificial intelligence,2021,Electronic Markets,31.0,2,,...,,,,,,,Germany,0.0,,
1,1,Ho J.-H.; Lee G.-G.; Lu M.-T.,"Ho, Juin-Hao (57218510909); Lee, Gwo-Guang (74...",57218510909; 7404852393; 55801461400,Exploring the implementation of a legal AI bot...,2020,Sustainability (Switzerland),12.0,15,5991,...,national,snaphot,present,quantitative,Taiwan,survey,Taiwan,0.0,,
2,1,Bartmann M.,"Bartmann, Marius (56512092600)",56512092600,The Ethics of AI-Powered Climate Nudging—How M...,2022,Sustainability (Switzerland),14.0,9,5153,...,,,,,,,Germany,0.0,,
3,1,Halsband A.,"Halsband, Aurélie (57562370000)",57562370000,Sustainable AI and Intergenerational Justice,2022,Sustainability (Switzerland),14.0,7,3922,...,,,,,,,Germany,0.0,,
4,1,Raman R.; Kumar Nair V.; Nedungadi P.; Ray I.;...,"Raman, Raghu (36618183700); Kumar Nair, Vinith...",36618183700; 57647914700; 36069838600; 5883060...,"Darkweb research: Past, present, and future tr...",2023,Heliyon,9.0,11,e22269,...,,,,mixed methods,,,India,1.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,1,Zhang Y.; Ji Y.; Qian H.,"Zhang, Yang (57471054500); Ji, Yuanhui (572047...",57471054500; 57204776640; 55186013100,Progress in thermodynamic simulation and syste...,2021,Green Chemical Engineering,2.0,3,,...,,,,,,,China,0.0,,
888,1,Jang J.; Kyun S.,"Jang, Jiyoung; Kyun, Suna",,An Innovative Career Management Platform Empow...,2022,"Journal of Logistics, Informatics and Service ...",9.0,1,,...,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,South Korea,0.0,,
889,1,Kabukye J.K.; Namugga J.; Mpamani C.J.; Katumb...,"Kabukye, Johnblack K.; Namugga, Jane; Mpamani,...",57205140187; 57201368167; 57403715300; 3623907...,Implementing Smartphone-Based Telemedicine for...,2023,Journal of Medical Internet Research,25.0,1,e45132,...,regional,snapshot in time,present,qualitative,Uganda,n.a.,Sweden; Uganda,0.0,,
890,1,Deng M.; Liu Y.; Chen L.,"Deng, Meizhen; Liu, Yimeng; Chen, Ling",58606568400; 58605588800; 57700546300,AI-driven innovation in ethnic clothing design...,2023,Electronic Research Archive,31.0,9,,...,local,snapshot in time,present,mixed methods,Biasha,n.a.,China,0.0,,


In [13]:
#how many columns and rows
df.shape

(892, 47)

In [15]:
#retriving them separately and printing the number of columns and rows
colnums = df.shape[1]
print(f"the number of columns is {colnums}")

the number of columns is 47


In [16]:
#another way to getting the number of rows, using len()
len(df)

892

In [17]:
#a more detailed overview, an information overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 47 columns):
 #   Column                                                                        Non-Null Count  Dtype  
---  ------                                                                        --------------  -----  
 0   Downloaded (1/0)                                                              764 non-null    object 
 1   Authors                                                                       892 non-null    object 
 2   Author full names                                                             880 non-null    object 
 3   Author(s) ID                                                                  879 non-null    object 
 4   Title                                                                         891 non-null    object 
 5   Year                                                                          892 non-null    int64  
 6   Source title                      

In [18]:
pd.set_option('display.max_columns', None) # to show all columns
#pd.reset_option('display.max_columns')
#checking the first 5 rows
df.head(5)

Unnamed: 0,Downloaded (1/0),Authors,Author full names,Author(s) ID,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,Page count,Cited by,DOI,Link,Author Keywords,Index Keywords,Abstract,Document Type,Publication Stage,Open Access,Source,EID,SDG,AI (yes/no),Sustainability (yes/no),"Type of AI \ninductive, text snippet","Algorithm(s) used\ninductive, text snippet",Method (1) vs. study object (2),AI as buzzword? (0/1),"Core topic (only one)\nshort text snippet, inductive",Role of AI\ndeductive,Means (1) vs. end (2),sustainability definition,Sus_lvl,empirical/conceptual/review,"spatial scale (individual, local, regional, national, supranational, global)",snapshot in time vs. longitudinal study,"temporal scale (past, present, future)",qualitative/quantitative/mixed methods,location of the study (country),Dataset used (inductive),country of the institution of the first author,Is there an own subsection for policy recommendations? 1/0,Notes,Coding complete
0,1,Thiebes S.; Lins S.; Sunyaev A.,"Thiebes, Scott (56319399400); Lins, Sebastian ...",56319399400; 56318996100; 24779131200,Trustworthy artificial intelligence,2021,Electronic Markets,31.0,2,,447.0,464.0,17.0,167,10.1007/s12525-020-00441-4,https://www.scopus.com/inward/record.uri?eid=2...,artificial intelligence; deep learning; emotio...,,Artificial intelligence (AI) brings forth many...,Article,Final,,Scopus,2-s2.0-85148853990,16,1,1,overall,unspecified,2,0,Societal impact of AI,,2,,weak sustainability,conceptual,,,,,,,Germany,0.0,,
1,1,Ho J.-H.; Lee G.-G.; Lu M.-T.,"Ho, Juin-Hao (57218510909); Lee, Gwo-Guang (74...",57218510909; 7404852393; 55801461400,Exploring the implementation of a legal AI bot...,2020,Sustainability (Switzerland),12.0,15,5991,,,,8,10.3390/su12155991,https://www.scopus.com/inward/record.uri?eid=2...,business; digital; education; entrepreneur; In...,,This study explores the implementation of lega...,Article,Final,,Scopus,2-s2.0-85168710066,16,1,1,overall,unspecified,2,0,legal AI bot,,1,,weak sustainability,empirical,national,snaphot,present,quantitative,Taiwan,survey,Taiwan,0.0,,
2,1,Bartmann M.,"Bartmann, Marius (56512092600)",56512092600,The Ethics of AI-Powered Climate Nudging—How M...,2022,Sustainability (Switzerland),14.0,9,5153,,,,6,10.3390/su14095153,https://www.scopus.com/inward/record.uri?eid=2...,Environmental effects; Green transition; Miner...,Asia; Economic and social effects; Economics; ...,The number of areas in which artificial intell...,Article,Final,,Scopus,2-s2.0-85174445734,16,1,1,overall,unspecified,2,0,ethics of AI-based climate nudging,,1,,weak sustainability,conceptual,,,,,,,Germany,0.0,,
3,1,Halsband A.,"Halsband, Aurélie (57562370000)",57562370000,Sustainable AI and Intergenerational Justice,2022,Sustainability (Switzerland),14.0,7,3922,,,,6,10.3390/su14073922,https://www.scopus.com/inward/record.uri?eid=2...,SDGs; ecological sustainability; intergenerati...,,"Recently, attention has been drawn to the sust...",Article,Final,,Scopus,2-s2.0-85139602753,16,1,1,overall,unspecified,2,0,intergenerational justice,,1,intergenerational justice,strong sustainability,conceptual,,,,,,,Germany,0.0,,
4,1,Raman R.; Kumar Nair V.; Nedungadi P.; Ray I.;...,"Raman, Raghu (36618183700); Kumar Nair, Vinith...",36618183700; 57647914700; 36069838600; 5883060...,"Darkweb research: Past, present, and future tr...",2023,Heliyon,9.0,11,e22269,,,,1,10.1016/j.heliyon.2023.e22269,https://www.scopus.com/inward/record.uri?eid=2...,Artificial Intelligence; Climate Change; Energ...,India; artificial intelligence; climate change...,"The Darkweb, part of the deep web, can be acce...",Article,Final,,Scopus,2-s2.0-85185331385,16,1,1,nlp,unspecified,2,0,dark web,,1,SDG 16,strong sustainability,review,,,,mixed methods,,,India,1.0,,


In [19]:
#checking the last 5 rows
df.tail(5)

Unnamed: 0,Downloaded (1/0),Authors,Author full names,Author(s) ID,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,Page count,Cited by,DOI,Link,Author Keywords,Index Keywords,Abstract,Document Type,Publication Stage,Open Access,Source,EID,SDG,AI (yes/no),Sustainability (yes/no),"Type of AI \ninductive, text snippet","Algorithm(s) used\ninductive, text snippet",Method (1) vs. study object (2),AI as buzzword? (0/1),"Core topic (only one)\nshort text snippet, inductive",Role of AI\ndeductive,Means (1) vs. end (2),sustainability definition,Sus_lvl,empirical/conceptual/review,"spatial scale (individual, local, regional, national, supranational, global)",snapshot in time vs. longitudinal study,"temporal scale (past, present, future)",qualitative/quantitative/mixed methods,location of the study (country),Dataset used (inductive),country of the institution of the first author,Is there an own subsection for policy recommendations? 1/0,Notes,Coding complete
887,1,Zhang Y.; Ji Y.; Qian H.,"Zhang, Yang (57471054500); Ji, Yuanhui (572047...",57471054500; 57204776640; 55186013100,Progress in thermodynamic simulation and syste...,2021,Green Chemical Engineering,2.0,3,,266.0,283.0,17.0,26,10.1016/j.gce.2021.06.003,https://www.scopus.com/inward/record.uri?eid=2...,nil,nil,"Due to the shortage of fossil energy, biomass ...",Review,Final,All Open Access; Gold Open Access,Scopus,2-s2.0-85126637055,9,1,1,AI,,2,1,Sustainable utilization of biomass resources,,1,,weak,Review,,,,,,,China,0.0,,
888,1,Jang J.; Kyun S.,"Jang, Jiyoung; Kyun, Suna",,An Innovative Career Management Platform Empow...,2022,"Journal of Logistics, Informatics and Service ...",9.0,1,,274.0,290.0,16.0,7,10.33168/LISS.2022.0117,https://www.scopus.com/inward/record.uri?eid=2...,artificial intelligence; big data; blockchain;...,n.a.,With the advent of the fourth industrial revol...,Article,Final,,Scopus,2-s2.0-85128453304,5,1,1,artificial intelligence; big data;\nblockchain...,n.a.,2,0,customized career management platform for fema...,,1,"buzzword; ""sustainable career management of ta...",weak,conceptual,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,South Korea,0.0,,
889,1,Kabukye J.K.; Namugga J.; Mpamani C.J.; Katumb...,"Kabukye, Johnblack K.; Namugga, Jane; Mpamani,...",57205140187; 57201368167; 57403715300; 3623907...,Implementing Smartphone-Based Telemedicine for...,2023,Journal of Medical Internet Research,25.0,1,e45132,,,,0,10.2196/45132,https://www.scopus.com/inward/record.uri?eid=2...,cervical cancer; cervicography; digital health...,Artificial Intelligence; Early Detection of Ca...,"Background: In Uganda, cervical cancer (CaCx) ...",Article,Final,All Open Access; Gold Open Access; Green Open ...,Scopus,2-s2.0-85174748358,5,1,1,AI; supervised learning,n.a.,1,0,smartphone-based store-and-forward telemedicin...,Forecasting,1,buzzword; longevity,weak,empirical,regional,snapshot in time,present,qualitative,Uganda,n.a.,Sweden; Uganda,0.0,,
890,1,Deng M.; Liu Y.; Chen L.,"Deng, Meizhen; Liu, Yimeng; Chen, Ling",58606568400; 58605588800; 57700546300,AI-driven innovation in ethnic clothing design...,2023,Electronic Research Archive,31.0,9,,5793.0,5814.0,21.0,0,10.3934/era.2023295,https://www.scopus.com/inward/record.uri?eid=2...,Artificial Intelligence; cultural preservation...,n.a.,This study delves into the innovative applicat...,Article,Final,All Open Access; Gold Open Access; Green Open ...,Scopus,2-s2.0-85171645006,5,1,1,AI; unsupervised learning; ML; Natural Langua...,Multimodal Unsupervised Image-to-Image Transla...,2,0,application of Artificial Intelligence (AI) an...,,1,"buzzword; ""sustainable development of ethnic f...",weak,empirical,local,snapshot in time,present,mixed methods,Biasha,n.a.,China,0.0,,
891,1,Zhu X.; Yao Q.; Dai W.; Ji L.; Yao Y.; Pang B....,"Zhu, Xingce; Yao, Qiang; Dai, Wei; Ji, Lu; Yao...",58221931100; 55588525000; 56366749800; 5684438...,Cervical cancer screening aided by artificial ...,2023,Bulletin of the World Health Organization,101.0,6,,381.0,390.0,9.0,0,10.2471/BLT.22.289061,https://www.scopus.com/inward/record.uri?eid=2...,,,Objective To implement and evaluate a large-sc...,Article,Final,All Open Access; Bronze Open Access; Green Ope...,Scopus,2-s2.0-85160969682,5,1,1,Ai,n.a.,1,0,online cervical cancer screening programme usi...,Forecasting,1,buzzword,weak,empirical,regional,snapshot in time,present,quantitative,Hubei Province China,,China,0.0,,


In [20]:
#checking a random slice of the dataframe
df.sample(5)

Unnamed: 0,Downloaded (1/0),Authors,Author full names,Author(s) ID,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,Page count,Cited by,DOI,Link,Author Keywords,Index Keywords,Abstract,Document Type,Publication Stage,Open Access,Source,EID,SDG,AI (yes/no),Sustainability (yes/no),"Type of AI \ninductive, text snippet","Algorithm(s) used\ninductive, text snippet",Method (1) vs. study object (2),AI as buzzword? (0/1),"Core topic (only one)\nshort text snippet, inductive",Role of AI\ndeductive,Means (1) vs. end (2),sustainability definition,Sus_lvl,empirical/conceptual/review,"spatial scale (individual, local, regional, national, supranational, global)",snapshot in time vs. longitudinal study,"temporal scale (past, present, future)",qualitative/quantitative/mixed methods,location of the study (country),Dataset used (inductive),country of the institution of the first author,Is there an own subsection for policy recommendations? 1/0,Notes,Coding complete
268,1,Abba S.I.; Benaafi M.; Usman A.G.; Ozsahin D.U...,"Abba, S.I. (57208942739); Benaafi, Mohammed (5...",57208942739; 56010674400; 57212103604; 3522235...,Mapping of groundwater salinization and modell...,2023,Science of the Total Environment,858.0,,159697.0,,,,9,10.1016/j.scitotenv.2022.159697,https://www.scopus.com/inward/record.uri?eid=2...,The growing increase in groundwater (GW) salin...,Artificial intelligence; Coastal aquifer; Grou...,Algorithms; Artificial Intelligence; Fuzzy Log...,Article,Final,,Scopus,2-s2.0-85141451707,6,1,1,adaptive neuro-fuzzy inference system (ANFIS);...,meta-heuristic algorithms; PSO algorithm,2,0,Mapping of groundwater salinization and modelling,Accelerated experimentation,1,long-term sustainable goals on the national an...,medium,empirical,regional,snapshot,present,quantitative,Saudi Arabia,,Saudi Arabia,0.0,,
565,1,Kabošová L.; Chronis A.; Galanos T.; Kmeť S.; ...,"Kabošová, Lenka (57665887100); Chronis, Angelo...",57665887100; 53463282600; 57194196901; 5592049...,Shape optimization during design for improving...,2022,Building and Environment,226.0,,109668.0,,,,7,10.1016/j.buildenv.2022.109668,https://www.scopus.com/inward/record.uri?eid=2...,This paper delivers an idea of weather-based o...,CFD; Computational fluid dynamics; InFraRed; P...,Kosice; Kosicky; Slovakia; Blending; Environme...,Article,Final,,Scopus,2-s2.0-85139867315,13,Yes,Yes,artificial intelligence,"Computational Fluid Dynamics (CFD), InFraRed",1,0,Developing cities in a climate friendly way wi...,Forecasting,1,0,weak,empirical,local,snapshot in time,present,quantitative,Slovakia,*epw EnergyPlus weather file,Slovakia,0.0,,
222,1,Carboni D.; Gluhak A.; McCann J.A.; Beach T.H.,"Carboni, Davide (57212489903); Gluhak, Alex (8...",57212489903; 8713966800; 7202464158; 36124728000,Contextualising water use in residential setti...,2016,Sensors (Switzerland),16.0,5.0,738.0,,,,26,10.3390/s16050738,https://www.scopus.com/inward/record.uri?eid=2...,Disaggregation algorithms; Machine learning; W...,Artificial intelligence; Data mining; Housing;...,Water monitoring in households is important to...,Review,Final,All Open Access; Gold Open Access; Green Open ...,Scopus,2-s2.0-84969544904,6,1,1,Machine learning; data mining,,1,0,analyse monitored data to obtain non-intrusive...,Data mining and remote sensing,1,,weak,conceptual,local,snapshot,present,,,,United Kingdom,0.0,,
398,1,Sharma Y.; Suri A.; Sijariya R.; Jindal L.,"Sharma, Yogesh (58640029700); Suri, Ankit (578...",58640029700; 57848818600; 57833538600; 5719428...,Role of education 4.0 in innovative curriculum...,2023,E-Learning and Digital Media,,,,,,,1,10.1177/20427530231221073,,artificial intelligence; augmented reality; di...,,The aim of the research is to investigate the ...,Article,Article in press,,Scopus,2-s2.0-85179931460,4,maybe,yes,AI; AR; IoT,,2,1,Role of education 4.0 in innovative curriculum...,,2,,medium,review,global,snapshot,present,quantitative,,,India,1.0,,
358,1,Flores-Vivar J.-M.; García-Peñalvo F.-J.,"Flores-Vivar, Jesús-Miguel (57094205100); Garc...",57094205100; 16031087300,"Reflections on the ethics, potential, and chal...",2023,Comunicar,30.0,74.0,,35.0,44.0,9.0,33,10.3916/C74-2023-03,,Artificial intelligence; Digital literacy; Edu...,,This article analyses and reflects on the ethi...,Article,Final,All Open Access; Gold Open Access; Green Open ...,Scopus,2-s2.0-85146338894,4,yes,yes,AI,,1;2,0,f AI and its capacity for action in the educat...,all,1;2,SDG4,strong,review,global,snapshot,present,qualitative,,,Spain,1.0,,


In [21]:
#checking the columns names
df.columns

Index(['Downloaded (1/0)', 'Authors', 'Author full names', 'Author(s) ID',
       'Title', 'Year', 'Source title', 'Volume', 'Issue', 'Art. No.',
       'Page start', 'Page end', 'Page count', 'Cited by', 'DOI', 'Link',
       'Author Keywords', 'Index Keywords', 'Abstract', 'Document Type',
       'Publication Stage', 'Open Access', 'Source', 'EID', 'SDG',
       'AI (yes/no)', 'Sustainability (yes/no)',
       'Type of AI \ninductive, text snippet',
       'Algorithm(s) used\ninductive, text snippet',
       'Method (1) vs. study object (2)', 'AI as buzzword? (0/1)',
       'Core topic (only one)\nshort text snippet, inductive',
       'Role of AI\ndeductive ', 'Means (1) vs. end (2)',
       'sustainability definition', 'Sus_lvl', 'empirical/conceptual/review',
       'spatial scale (individual, local, regional, national, supranational, global)',
       'snapshot in time vs. longitudinal study',
       'temporal scale (past, present, future)',
       'qualitative/quantitative/mixed 

In [22]:
df.columns[5]

'Year'

In [23]:
#getting the index
df.index

RangeIndex(start=0, stop=892, step=1)

In [24]:
#getting some descriptive statistics for numeric
df.describe()

Unnamed: 0,Year,Volume,Page count,Cited by,SDG,Is there an own subsection for policy recommendations? 1/0,Coding complete
count,892.0,860.0,365.0,892.0,892.0,850.0,41.0
mean,2020.880045,92.473256,15.583562,40.318386,8.165919,0.145882,1.0
std,2.754888,229.278177,8.631132,73.286039,4.722542,0.353196,0.0
min,1998.0,1.0,3.0,0.0,1.0,0.0,1.0
25%,2020.0,12.0,10.0,6.0,4.0,0.0,1.0
50%,2022.0,21.0,14.0,19.0,7.0,0.0,1.0
75%,2023.0,72.0,19.0,41.0,12.0,0.0,1.0
max,2024.0,2023.0,68.0,798.0,18.0,1.0,1.0


In [26]:
#getting some descriptive statistics for categories or object data types
df.describe(include="object")

Unnamed: 0,Downloaded (1/0),Authors,Author full names,Author(s) ID,Title,Source title,Issue,Art. No.,Page start,Page end,DOI,Link,Author Keywords,Index Keywords,Abstract,Document Type,Publication Stage,Open Access,Source,EID,AI (yes/no),Sustainability (yes/no),"Type of AI \ninductive, text snippet","Algorithm(s) used\ninductive, text snippet",Method (1) vs. study object (2),AI as buzzword? (0/1),"Core topic (only one)\nshort text snippet, inductive",Role of AI\ndeductive,Means (1) vs. end (2),sustainability definition,Sus_lvl,empirical/conceptual/review,"spatial scale (individual, local, regional, national, supranational, global)",snapshot in time vs. longitudinal study,"temporal scale (past, present, future)",qualitative/quantitative/mixed methods,location of the study (country),Dataset used (inductive),country of the institution of the first author,Notes
count,764,892,880,879,891,892,522,527,366,366,886,804,635,582,861,881,881,560,881,892,891,891,851,509,845,811,855,576,846,489,841,856,562,627,624,631,510,566,855,97
unique,3,868,857,853,875,444,67,509,306,327,870,789,353,298,846,2,2,8,1,873,5,7,286,361,6,3,735,48,7,223,12,18,26,16,22,20,119,312,107,41
top,1,Mhlanga D.,"Mhlanga, David (57218104204)",57218104204,A review of the Artificial Intelligence (AI) b...,Sustainability (Switzerland),1,14,1,171,10.1016/j.compag.2023.107836,https://www.scopus.com/inward/record.uri?eid=2...,nil,nil,[No abstract available],Article,Final,All Open Access; Gold Open Access,Scopus,2-s2.0-85148853990,1,1,AI,n.a.,2,0,Smart farming,System optimization,1,SDGs,weak,empirical,local,snapshot,present,quantitative,n.a.,n.a.,China,"Duplicate, deleted in SDG9"
freq,758,4,4,4,2,105,74,4,17,4,2,2,200,200,4,677,857,204,881,2,722,713,281,36,538,707,17,150,663,75,455,332,131,307,365,328,86,82,94,15


### Quality of the dataframe
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [32]:
#checking duplicates in the dataframe
df.duplicated().sum()
print(f"this dataset has {int(df.duplicated().sum())} duplicates")

this dataset has 0 duplicates


In [34]:
#how many missing values
df.isna().sum()

Downloaded (1/0)                                                                128
Authors                                                                           0
Author full names                                                                12
Author(s) ID                                                                     13
Title                                                                             1
Year                                                                              0
Source title                                                                      0
Volume                                                                           32
Issue                                                                           370
Art. No.                                                                        365
Page start                                                                      526
Page end                                                                    

In [37]:
#checking the datatypes
df.dtypes

Downloaded (1/0)                                                                 object
Authors                                                                          object
Author full names                                                                object
Author(s) ID                                                                     object
Title                                                                            object
Year                                                                              int64
Source title                                                                     object
Volume                                                                          float64
Issue                                                                            object
Art. No.                                                                         object
Page start                                                                       object
Page end                        

In [39]:
#which SDGs do we have?
df["SDG"].unique()

array([16, 14, 13, 15,  8, 10,  5, 11,  6, 12,  4,  7,  1,  3,  2, 18,  9])

In [40]:
df.columns

Index(['Downloaded (1/0)', 'Authors', 'Author full names', 'Author(s) ID',
       'Title', 'Year', 'Source title', 'Volume', 'Issue', 'Art. No.',
       'Page start', 'Page end', 'Page count', 'Cited by', 'DOI', 'Link',
       'Author Keywords', 'Index Keywords', 'Abstract', 'Document Type',
       'Publication Stage', 'Open Access', 'Source', 'EID', 'SDG',
       'AI (yes/no)', 'Sustainability (yes/no)',
       'Type of AI \ninductive, text snippet',
       'Algorithm(s) used\ninductive, text snippet',
       'Method (1) vs. study object (2)', 'AI as buzzword? (0/1)',
       'Core topic (only one)\nshort text snippet, inductive',
       'Role of AI\ndeductive ', 'Means (1) vs. end (2)',
       'sustainability definition', 'Sus_lvl', 'empirical/conceptual/review',
       'spatial scale (individual, local, regional, national, supranational, global)',
       'snapshot in time vs. longitudinal study',
       'temporal scale (past, present, future)',
       'qualitative/quantitative/mixed 

In [41]:
df["temporal scale (past, present, future)"].unique()

array([nan, 'present', 'future', 'present; future',
       'past; present; future', 'past, present, future',
       'present, future', 'past, present', 'n.a.', 'past',
       'present (Feb 2021 - May 2021)', 'N.A.', 'past;present',
       'past;present;future', 'present;future', 'past; present',
       'presemt', 'snapshot', 'past; future', 'present/past', 'present ',
       'Future', 'Present'], dtype=object)

In [42]:
#how many SDGs do we have?
df["SDG"].nunique()

17

In [43]:
#getting a contingency table of SDGs
df["SDG"].value_counts()

SDG
4     100
3      96
6      91
2      91
11     86
13     79
7      75
12     71
9      62
18     48
15     30
14     18
16     13
8      11
5       8
1       8
10      5
Name: count, dtype: int64

In [45]:
sdg_counts = df["SDG"].value_counts()
sdg_df = pd.DataFrame(sdg_counts) #converting the object into a Data Frame
sdg_df = sdg_df.reset_index(drop=False)
sdg_df

Unnamed: 0,SDG,count
0,4,100
1,3,96
2,6,91
3,2,91
4,11,86
5,13,79
6,7,75
7,12,71
8,9,62
9,18,48


In [None]:
#another variable to check?


In [None]:
#saving a contigengy table into a variable for "location of the study"


In [None]:
#reseting the index


## Dataframe Operations
<div style="
    border: 4px solid green;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

### Modifying the index
<div style="
    border: 2px solid green;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#making a copy of your dataset. Recommended especially if you are modifying the original df
copy_df = df.copy()

In [None]:
# putting a custom index, for example that 
#for example, starting from 100, you need to make sure, your index fits the lenght of rows


In [None]:
# if you want to reset the index


### Dropping row and columns and renaming them
<div style="
    border: 2px solid green;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#let's delete some columns, for example the first column


In [None]:
# if you want to delete some columns by their positions


In [None]:
# if you want to drop several columns


In [None]:
#checking duplicated rows based on a column, for example EID


In [None]:
#which are those duplicated in EID column


In [None]:
# dropping duplicates but from an specific column


In [None]:
# checking missing values from the abstract column


In [None]:
# dropping missing values from the abstract column


In [None]:
#renaming columns


In [None]:
#you can also rename the column based on the position


In [None]:
#you can rename several columns at once


## Index and Slicing
<div style="
    border: 4px solid blue;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#selecting a column


In [None]:
#use double square brackets to be shown with df format


In [None]:
#selecting rows by index position


In [None]:
# or use double square brackets


In [None]:
#selecting the first 20 rows with all columns


In [None]:
#selecting the first 10 rows with the first five columns


In [None]:
#select the last 10 rows with the columns 3 to 8


In [None]:
#selecting rows by label and index


In [None]:
# conditional selection, for example all articles with more than 40 citations


In [None]:
# transforming SDG column to object


In [None]:
# selecting only articles from SDG 10 and 5

## Exercise
Explore the following dataset: datasets/socialmedia_engagement

Does it look real or machine generated?
<div style="
    border: 4px solid red;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>