## Pandas

In [1]:
import pandas as pd

## Generating Data

### What is a Data Frame?

In [98]:
# Data Frame

df = pd.DataFrame(
    {
        "Alice": ["Blonde", "Blue"],
        "Bob": ["Black", "Brown"],
        "Charlie": ["Brown", "Green"],
        "Daniel": ["Ginger", "Blue"],
        "Emily": ["Blue", "Brown"],
        "Felix": ["Gray", "Hazel"]
    },
    index = ["Hair Colour", "Eye colours"]
)

Here we feed pandas' DataFrame constructor a dictionary that generates a dataframe `df`   
The name's Alice and Bob here are the column names   
We also index the rows through the index `kwarg`

In [97]:
print(f"Example DataFrame: \n {df}\n")

Example DataFrame: 
               Alice    Bob Charlie  Daniel  Emily  Felix
Hair Colour  Blonde  Black   Brown  Ginger   Blue   Gray
Eye colours    Blue  Brown   Green    Blue  Brown  Hazel



To find out more information about the dataframe we can:   
- Run a method that print information about it, 
- Call the shape attribute,
- Print the first 5 rows using the head method

In [100]:
print(f"Size of the dataframe = {df.shape}\n")

Size of the dataframe = (2, 6)



In [101]:
print(f"Info about the dataframe:\n")
df.info()

Info about the dataframe:

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Hair Colour to Eye colours
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Alice    2 non-null      object
 1   Bob      2 non-null      object
 2   Charlie  2 non-null      object
 3   Daniel   2 non-null      object
 4   Emily    2 non-null      object
 5   Felix    2 non-null      object
dtypes: object(6)
memory usage: 112.0+ bytes


In [None]:
print(f"Head of the dataframe:\n")
df.head()

It's not uncommon to find datsets that have the same structure but concern different sets of examples e.g.



In [93]:
df2 = pd.DataFrame({
    "Grace": ["Black", "Brown"],
    "Harry": ["Brown", "Green"],
    "Isaac": ["Blonde", "Blue"],
    "Kelly": ["Gray", "Hazel"],
    "Lola": ["Ginger", "Blue"]
},
    index = ["Hair Colour", "Eye colours"]
)

# We can extend the two dataframes using the join method
df.join(df2)

Unnamed: 0,Alice,Bob,Charlie,Daniel,Emily,Felix,Grace,Harry,Isaac,Kelly,Lola
Hair Colour,Blonde,Black,Brown,Ginger,Blue,Gray,Black,Brown,Blonde,Gray,Ginger
Eye colours,Blue,Brown,Green,Blue,Brown,Hazel,Brown,Green,Blue,Hazel,Blue


In [96]:
# Or if we want to add new data to the dataframe, we can use the concat method

heights = pd.DataFrame(
    {
        "Alice": 1.75,
        "Bob": 1.80,
        "Charlie": 1.70,
        "Daniel": 1.85,
        "Emily": 1.65,
        "Felix": 1.90,
        "Grace": 1.75,
        "Harry": 1.80,
        "Isaac": 1.70,
        "Kelly": 1.85,
        "Lola": 1.65
    },
    index = ["Height"]
)

pd.concat([df, heights])

Unnamed: 0,Alice,Bob,Charlie,Daniel,Emily,Felix,Grace,Harry,Isaac,Kelly,Lola
Hair Colour,Blonde,Black,Brown,Ginger,Blue,Gray,,,,,
Eye colours,Blue,Brown,Green,Blue,Brown,Hazel,,,,,
Height,1.75,1.8,1.7,1.85,1.65,1.9,1.75,1.8,1.7,1.85,1.65


### What are Series   
You can think of a DataFrame as a table and a Series as a list - and hence instead of feeding it a dict, we feed a list.    
We can index the entries in the same way we did for the DataFrame   
We have the additional kwarg here to name the series.   

In [4]:
series = pd.Series(
    ["Ruaidhri", "Ruaidhri", "Luke", "Thomas"],
    index = ["2022", "2023", "2024", "2025"],
    name = "TPSA Treasurers"
)

print(f"\nExample Series: \n{series}\n")

# To find out more information about the series:
print(f"Series name: {series.name}\n")
print(f"Series index: {series.index}\n")
print(f"Series values: {series.values}\n")
print(f"Series size: {series.size}\n")

series.info()


Example Series: 
2022    Ruaidhri
2023    Ruaidhri
2024        Luke
2025      Thomas
Name: TPSA Treasurers, dtype: object

Series name: TPSA Treasurers

Series index: Index(['2022', '2023', '2024', '2025'], dtype='object')

Series values: ['Ruaidhri' 'Ruaidhri' 'Luke' 'Thomas']

Series size: 4

<class 'pandas.core.series.Series'>
Index: 4 entries, 2022 to 2025
Series name: TPSA Treasurers
Non-Null Count  Dtype 
--------------  ----- 
4 non-null      object
dtypes: object(1)
memory usage: 64.0+ bytes


For most datasets, we won't have be able to just create a DataFrame or Series from scratch.    
Instead, we'll need to read in data from a file.   

In [4]:
latest_articles = pd.read_csv("./workshop/latest_research_articles.csv")
latest_articles.head()


Unnamed: 0,title,abstract,doi,citations,accesses,online_attention,published_datetime,tweeters,blogs,facebook_pages,news_outlets,redditors,video_uploaders,wikipedia_page,mendeley,Topic
0,Estimates of the reproduction ratio from epide...,Accurate estimates of the reproduction ratio a...,https://doi.org/10.1038/s41567-024-02471-7,0,0,0,25 April 2024,5,0,0,0,0,0,0,0,Physics
1,Spin Berry curvature-enhanced orbital Zeeman e...,Berry phases and the related concept of Berry ...,https://doi.org/10.1038/s41567-024-02487-z,0,801,1,22 April 2024,1,0,0,0,0,0,0,0,Physics
2,Room-temperature flexible manipulation of the ...,The quantum metric and Berry curvature are two...,https://doi.org/10.1038/s41567-024-02476-2,0,1029,53,22 April 2024,14,0,1,7,0,0,0,0,Physics
3,Irreversible entropy transport enhanced by fer...,The nature of particle and entropy flow betwee...,https://doi.org/10.1038/s41567-024-02483-3,0,636,1,22 April 2024,2,0,0,0,0,0,0,0,Physics
4,Penning-trap measurement of the Q value of ele...,The investigation of the absolute scale of the...,https://doi.org/10.1038/s41567-024-02461-9,0,2025,105,19 April 2024,6,1,0,14,0,0,0,0,Physics


This leaves the data to be indexed by the default index, which is just a range of numbers.

In [10]:
# We can also specify the index column
latest_articles = pd.read_csv("latest_research_articles.csv", index_col = 0) # 0 or "title" will work
latest_articles.head()

Unnamed: 0_level_0,abstract,doi,citations,accesses,online_attention,published_datetime,tweeters,blogs,facebook_pages,news_outlets,redditors,video_uploaders,wikipedia_page,mendeley,Topic
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Estimates of the reproduction ratio from epidemic surveillance may be biased in spatially structured populations,Accurate estimates of the reproduction ratio a...,https://doi.org/10.1038/s41567-024-02471-7,0,0,0,25 April 2024,5,0,0,0,0,0,0,0,Physics
Spin Berry curvature-enhanced orbital Zeeman effect in a kagome metal,Berry phases and the related concept of Berry ...,https://doi.org/10.1038/s41567-024-02487-z,0,801,1,22 April 2024,1,0,0,0,0,0,0,0,Physics
Room-temperature flexible manipulation of the quantum-metric structure in a topological chiral antiferromagnet,The quantum metric and Berry curvature are two...,https://doi.org/10.1038/s41567-024-02476-2,0,1029,53,22 April 2024,14,0,1,7,0,0,0,0,Physics
Irreversible entropy transport enhanced by fermionic superfluidity,The nature of particle and entropy flow betwee...,https://doi.org/10.1038/s41567-024-02483-3,0,636,1,22 April 2024,2,0,0,0,0,0,0,0,Physics
Penning-trap measurement of the Q value of electron capture in 163Ho for the determination of the electron neutrino mass,The investigation of the absolute scale of the...,https://doi.org/10.1038/s41567-024-02461-9,0,2025,105,19 April 2024,6,1,0,14,0,0,0,0,Physics


In [80]:
# Or we can even just rename the index itself
def index_renamer(length: int) -> dict:
    return {i: f"Article {i+1}" for i in range(0, length)}

latest_articles.reset_index(inplace= True, drop = True) 
# Inplace = True means that the changes are made to the original DataFrame
# Drop = True means that the original index is dropped

latest_articles.rename(index = index_renamer(len(latest_articles)), inplace = True)
latest_articles.head()

Unnamed: 0,title,abstract,doi,citations,accesses,online_attention,published_datetime,tweeters,blogs,facebook_pages,news_outlets,redditors,video_uploaders,wikipedia_page,mendeley,Topic
Article 1,Estimates of the reproduction ratio from epide...,Accurate estimates of the reproduction ratio a...,https://doi.org/10.1038/s41567-024-02471-7,0,0,0,25 April 2024,5,0,0,0,0,0,0,0,Physics
Article 2,Spin Berry curvature-enhanced orbital Zeeman e...,Berry phases and the related concept of Berry ...,https://doi.org/10.1038/s41567-024-02487-z,0,801,1,22 April 2024,1,0,0,0,0,0,0,0,Physics
Article 3,Room-temperature flexible manipulation of the ...,The quantum metric and Berry curvature are two...,https://doi.org/10.1038/s41567-024-02476-2,0,1029,53,22 April 2024,14,0,1,7,0,0,0,0,Physics
Article 4,Irreversible entropy transport enhanced by fer...,The nature of particle and entropy flow betwee...,https://doi.org/10.1038/s41567-024-02483-3,0,636,1,22 April 2024,2,0,0,0,0,0,0,0,Physics
Article 5,Penning-trap measurement of the Q value of ele...,The investigation of the absolute scale of the...,https://doi.org/10.1038/s41567-024-02461-9,0,2025,105,19 April 2024,6,1,0,14,0,0,0,0,Physics


### Working with dataframes

To access a column in a DataFrame, we can use the column name as an attribute. Recall that a list maps to a Series and a dict maps to a DataFrame. That being said, this mapping isn't perfect, since we can actually treat series like dictionaries as well

In [38]:
print(f"Column 'DOI': \n{latest_articles.doi}\n")

# To access an entry in a dictionary, we use the key. The same applies here
print(f"Column 'DOI' again: \n{latest_articles['doi']}\n")

# We can access the entries in a Series using the index
print(f"First entry in 'DOI': {latest_articles.doi[0]}\n")

Column 'DOI': 
0       https://doi.org/10.1038/s41567-024-02471-7
1       https://doi.org/10.1038/s41567-024-02487-z
2       https://doi.org/10.1038/s41567-024-02476-2
3       https://doi.org/10.1038/s41567-024-02483-3
4       https://doi.org/10.1038/s41567-024-02461-9
                           ...                    
4179    https://doi.org/10.1038/s41598-024-58273-7
4180    https://doi.org/10.1038/s41598-024-58735-y
4181    https://doi.org/10.1038/s41467-023-44627-8
4182    https://doi.org/10.1038/s41598-024-59094-4
4183    https://doi.org/10.1038/s41467-023-44620-1
Name: doi, Length: 4184, dtype: object

Column 'DOI' again: 
0       https://doi.org/10.1038/s41567-024-02471-7
1       https://doi.org/10.1038/s41567-024-02487-z
2       https://doi.org/10.1038/s41567-024-02476-2
3       https://doi.org/10.1038/s41567-024-02483-3
4       https://doi.org/10.1038/s41567-024-02461-9
                           ...                    
4179    https://doi.org/10.1038/s41598-024-58273-7
4180  

To access rows, or more generally avoid this ambiguity, we use pandas built in `iloc` function

In [15]:
# This function takes in an integer and returns the row at that index
print(f"First row: \n{latest_articles.iloc[0]}\n")

# Or to find the first entry in a series
print(f"First entry in 'DOI': {latest_articles.doi.iloc[0]}\n")

First row: 
abstract              Accurate estimates of the reproduction ratio a...
doi                          https://doi.org/10.1038/s41567-024-02471-7
citations                                                             0
accesses                                                              0
online_attention                                                      0
published_datetime                                        25 April 2024
tweeters                                                              5
blogs                                                                 0
facebook_pages                                                        0
news_outlets                                                          0
redditors                                                             0
video_uploaders                                                       0
wikipedia_page                                                        0
mendeley                                            

This indexing is the same as we would use for a list so we can find the first 5 rows using slicing.

In [16]:
print(f"First 5 rows: \n{latest_articles.iloc[:5]}\n")

First 5 rows: 
                                                                                             abstract  \
title                                                                                                   
Estimates of the reproduction ratio from epidem...  Accurate estimates of the reproduction ratio a...   
Spin Berry curvature-enhanced orbital Zeeman ef...  Berry phases and the related concept of Berry ...   
Room-temperature flexible manipulation of the q...  The quantum metric and Berry curvature are two...   
Irreversible entropy transport enhanced by ferm...  The nature of particle and entropy flow betwee...   
Penning-trap measurement of the Q value of elec...  The investigation of the absolute scale of the...   

                                                                                           doi  \
title                                                                                            
Estimates of the reproduction ratio from epidem...  h

We can also use `iloc` to access a specific entry of a row

In [18]:
print(f"First entry in 'DOI': {latest_articles.iloc[0, 1]}\n")

First entry in 'DOI': https://doi.org/10.1038/s41567-024-02471-7



Using indices can be difficult to keep track of, so we can also use the `loc` function. This function takes in the index and column name.   


In [47]:
# Undoing the indexing
latest_articles = latest_articles.reset_index()

print(f"First entry in 'DOI' again: {latest_articles.loc[0, 'doi']}\n")

# The 0 here is the index of the row, and the 'doi' is the column name

# So if we reindex the DataFrame, this can change
latest_articles = latest_articles.set_index("doi")

# Now we can access the title of the first article when sorting by the DOI
title = latest_articles.loc["https://doi.org/10.1038/s41567-024-02471-7", "title"]
print(f"Title of the first article: {title}\n")

First entry in 'DOI' again: https://doi.org/10.1038/s41567-024-02471-7

Title of the first article: Estimates of the reproduction ratio from epidemic surveillance may be biased in spatially structured populations



In [48]:
# Note that we can also reindex in place
print(f"Info before reindexing:\n")
latest_articles.info()
latest_articles.set_index("title", inplace = True)


Info before reindexing:

<class 'pandas.core.frame.DataFrame'>
Index: 4184 entries, https://doi.org/10.1038/s41567-024-02471-7 to https://doi.org/10.1038/s41467-023-44620-1
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   title               4184 non-null   object
 1   abstract            4169 non-null   object
 2   citations           4184 non-null   int64 
 3   accesses            4184 non-null   object
 4   online_attention    4184 non-null   int64 
 5   published_datetime  4184 non-null   object
 6   tweeters            4184 non-null   int64 
 7   blogs               4184 non-null   int64 
 8   facebook_pages      4184 non-null   int64 
 9   news_outlets        4184 non-null   int64 
 10  redditors           4184 non-null   int64 
 11  video_uploaders     4184 non-null   int64 
 12  wikipedia_page      4184 non-null   int64 
 13  mendeley            4184 non-null   int64 
 14  Topic               4

In [49]:
print(f"Info after reindexing: \n")
latest_articles.info()

Info after reindexing: 

<class 'pandas.core.frame.DataFrame'>
Index: 4184 entries, Estimates of the reproduction ratio from epidemic surveillance may be biased in spatially structured populations to Purely self-rectifying memristor-based passive crossbar array for artificial neural network accelerators
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   abstract            4169 non-null   object
 1   citations           4184 non-null   int64 
 2   accesses            4184 non-null   object
 3   online_attention    4184 non-null   int64 
 4   published_datetime  4184 non-null   object
 5   tweeters            4184 non-null   int64 
 6   blogs               4184 non-null   int64 
 7   facebook_pages      4184 non-null   int64 
 8   news_outlets        4184 non-null   int64 
 9   redditors           4184 non-null   int64 
 10  video_uploaders     4184 non-null   int64 
 11  wikipedia_page      4184 non-null

In [51]:
# Undoing the index column
latest_articles = pd.read_csv("latest_research_articles.csv")

# We can check the bool value of a condition for each entry in a column
# This is useful for filtering data
latest_articles.Topic == "Physics"

0        True
1        True
2        True
3        True
4        True
        ...  
4179    False
4180    False
4181    False
4182    False
4183    False
Name: Topic, Length: 4184, dtype: bool

In [52]:
# Using this, we can filter the data
latest_articles.loc[latest_articles.Topic == "Physics"]

# You can also combine conditions using the & (and) as well as the | (or) operators

Unnamed: 0,title,abstract,doi,citations,accesses,online_attention,published_datetime,tweeters,blogs,facebook_pages,news_outlets,redditors,video_uploaders,wikipedia_page,mendeley,Topic
0,Estimates of the reproduction ratio from epide...,Accurate estimates of the reproduction ratio a...,https://doi.org/10.1038/s41567-024-02471-7,0,0,0,25 April 2024,5,0,0,0,0,0,0,0,Physics
1,Spin Berry curvature-enhanced orbital Zeeman e...,Berry phases and the related concept of Berry ...,https://doi.org/10.1038/s41567-024-02487-z,0,801,1,22 April 2024,1,0,0,0,0,0,0,0,Physics
2,Room-temperature flexible manipulation of the ...,The quantum metric and Berry curvature are two...,https://doi.org/10.1038/s41567-024-02476-2,0,1029,53,22 April 2024,14,0,1,7,0,0,0,0,Physics
3,Irreversible entropy transport enhanced by fer...,The nature of particle and entropy flow betwee...,https://doi.org/10.1038/s41567-024-02483-3,0,636,1,22 April 2024,2,0,0,0,0,0,0,0,Physics
4,Penning-trap measurement of the Q value of ele...,The investigation of the absolute scale of the...,https://doi.org/10.1038/s41567-024-02461-9,0,2025,105,19 April 2024,6,1,0,14,0,0,0,0,Physics
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,Signature of effective mass in crackling-noise...,Crackling noise is a common feature in many dy...,https://doi.org/10.1038/nphys101,107,2233,0,29 September 2005,0,0,0,0,0,0,0,0,Physics
2926,Criticality in correlated quantum matter,At quantum critical points (QCPs) quantum fluc...,https://doi.org/10.1038/nphys105,98,2904,0,29 September 2005,0,0,0,0,0,0,0,0,Physics
2927,Spatial imaging of the spin Hall effect and cu...,Spin–orbit coupling in semiconductors relates ...,https://doi.org/10.1038/nphys009,405,8005,1,29 September 2005,3,0,0,0,0,0,0,279,Physics
2928,The role of the interlayer state in the electr...,"Although not an intrinsic superconductor, grap...",https://doi.org/10.1038/nphys119,247,7730,12,29 September 2005,1,1,0,0,0,0,4,236,Physics


An important use of these filtering techniques is to get rid of missing data.   
These usually appear as NaN in the DataFrame.   
To return true (false) for `NaN` values, we use the `isnull` (`notnull`) functions.   
e.g. `latest_articles[latest_articles.citations.notnull()]` will only keep entries with valid citation numbers.

In [55]:
# Alternatively, we can use the native pandas function isin
latest_articles.loc[latest_articles.Topic.isin(["Physics", "Engineering"])]

Unnamed: 0,title,abstract,doi,citations,accesses,online_attention,published_datetime,tweeters,blogs,facebook_pages,news_outlets,redditors,video_uploaders,wikipedia_page,mendeley,Topic
0,Estimates of the reproduction ratio from epide...,Accurate estimates of the reproduction ratio a...,https://doi.org/10.1038/s41567-024-02471-7,0,0,0,25 April 2024,5,0,0,0,0,0,0,0,Physics
1,Spin Berry curvature-enhanced orbital Zeeman e...,Berry phases and the related concept of Berry ...,https://doi.org/10.1038/s41567-024-02487-z,0,801,1,22 April 2024,1,0,0,0,0,0,0,0,Physics
2,Room-temperature flexible manipulation of the ...,The quantum metric and Berry curvature are two...,https://doi.org/10.1038/s41567-024-02476-2,0,1029,53,22 April 2024,14,0,1,7,0,0,0,0,Physics
3,Irreversible entropy transport enhanced by fer...,The nature of particle and entropy flow betwee...,https://doi.org/10.1038/s41567-024-02483-3,0,636,1,22 April 2024,2,0,0,0,0,0,0,0,Physics
4,Penning-trap measurement of the Q value of ele...,The investigation of the absolute scale of the...,https://doi.org/10.1038/s41567-024-02461-9,0,2025,105,19 April 2024,6,1,0,14,0,0,0,0,Physics
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4179,Short-term creep approach to redefining the ro...,The creep response of the 17-4PH martensitic a...,https://doi.org/10.1038/s41598-024-58273-7,0,134,0,09 April 2024,0,0,0,0,0,0,0,0,Engineering
4180,Jamming precoding in AF relay-aided PLC system...,Enhancing information security has become incr...,https://doi.org/10.1038/s41598-024-58735-y,0,136,0,09 April 2024,0,0,0,0,0,0,0,0,Engineering
4181,Frequency-hopping wave engineering with metasu...,Wave phenomena can be artificially engineered ...,https://doi.org/10.1038/s41467-023-44627-8,1,3634,10,03 January 2024,1,0,0,1,0,0,0,5,Engineering
4182,The effect of floating spline parameter on the...,The load sharing performance of encased differ...,https://doi.org/10.1038/s41598-024-59094-4,0,148,0,09 April 2024,0,0,0,0,0,0,0,0,Engineering


In [None]:
# To add in new columns, we can use the same syntax as we would for a dictionary

def get_based(df: DataFrame) -> list:
    based = []
    for entry in df.Topic:
        if entry == "Physics":
            based.append(True)
        else:
            based.append(False)
    return based

latest_articles["Based"] = get_based(latest_articles)

We can find out more about the data using different methods. Specifically we can: 
- Get a summary of all the data
- Get a summary of the title column
- Get the number of entries (and unique entries) of each value in a certain column
- Get the `max`, `min`, and `mean` of a given numerical column

In [6]:
latest_articles.describe() # Gives a summary of the data


Unnamed: 0,citations,online_attention,tweeters,blogs,facebook_pages,news_outlets,redditors,video_uploaders,wikipedia_page,mendeley
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,106.360421,46.59608,19.141013,0.797323,0.444312,4.584369,0.086042,0.049713,0.30043,104.605641
std,202.597338,129.670858,61.791609,2.084982,2.229109,15.098942,0.445184,0.350424,1.386331,148.328279
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0
50%,45.0,12.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,57.0
75%,134.0,49.0,17.0,1.0,0.0,4.0,0.0,0.0,0.0,143.0
max,5098.0,3774.0,2034.0,36.0,79.0,430.0,8.0,9.0,45.0,2710.0


In [7]:
latest_articles.title.describe() # Gives a summary of the title column

count                                                 4184
unique                                                4170
top       Interface frictional anisotropy of dilative sand
freq                                                     2
Name: title, dtype: object

In [8]:
latest_articles.Topic.value_counts() # Gives the number of entries for each unique value in the Topic column

Topic
Physics                  2930
Engineering              1000
Computational Science     197
Pure-Mathematics           57
Name: count, dtype: int64

In [9]:
latest_articles.citations.mean() # Gives the mean of the citations column

106.3604206500956

In [21]:
latest_articles.citations.min() # Gives the minimum of the citations column

0

In [22]:
latest_articles.citations.max() # Gives the maximum of the citations column

5098

Often, it's important to transform data. For instance, when changing units or normalising data.   
We do this in one of two ways

In [28]:
# 1.
# Change the values in a column using the map method
# Say we want to change the values in the Topic column to lowercase
latest_articles.Topic.map(lambda r: r.lower())

0           physics
1           physics
2           physics
3           physics
4           physics
           ...     
4179    engineering
4180    engineering
4181    engineering
4182    engineering
4183    engineering
Name: Topic, Length: 4184, dtype: object

Note that this doesn't change the values in the DataFrame.   
To do that, we can assign the result back to the column.

In [20]:
# 2,
# Change the values in a column using the apply method
# Say we want to change the values in the Topic column to uppercase

latest_articles.apply(lambda r: r.Topic.upper(), axis = "columns")  # We can also apply functions to each column by setting axis = "index"

0           PHYSICS
1           PHYSICS
2           PHYSICS
3           PHYSICS
4           PHYSICS
           ...     
4179    ENGINEERING
4180    ENGINEERING
4181    ENGINEERING
4182    ENGINEERING
4183    ENGINEERING
Length: 4184, dtype: object

Sometimes, we need to get the information of a certain class of rows. For that, we use `groupby`

In [42]:
latest_articles.groupby("Topic").citations.max()

Topic
Computational Science     297
Engineering               794
Physics                  5098
Pure-Mathematics          162
Name: citations, dtype: int64

In [63]:
latest_articles.groupby(["Topic", "published_datetime"]).citations.agg([max]) # This tells us the maximum number of citations in each Topic on each date

  latest_articles.groupby(["Topic", "published_datetime"]).citations.agg([max]) # This tells us the maximum number of citations in each Topic on each date


Unnamed: 0_level_0,Unnamed: 1_level_0,max
Topic,published_datetime,Unnamed: 2_level_1
Computational Science,01 February 2021,32
Computational Science,01 June 2023,9
Computational Science,01 May 2023,10
Computational Science,02 June 2022,18
Computational Science,03 October 2022,7
...,...,...
Pure-Mathematics,28 July 2023,2
Pure-Mathematics,28 June 2023,4
Pure-Mathematics,29 July 2016,27
Pure-Mathematics,29 September 2017,23


Now that we know how to group data - it's equally important to be able to sort it.   
We can sort the data by a column using the `sort_values()` method.

In [67]:
# Here, we sort the data by the citations column in descending order as well as the Topic column
latest_articles.sort_values(by = ["citations", "Topic"], ascending= False) 

Unnamed: 0,title,abstract,doi,citations,accesses,online_attention,published_datetime,tweeters,blogs,facebook_pages,news_outlets,redditors,video_uploaders,wikipedia_page,mendeley,Topic
2474,"Topological insulators in Bi2Se3, Bi2Te3 and S...",Topological insulators are new states of quant...,https://doi.org/10.1038/nphys1270,5098,85k,27,10 May 2009,1,1,0,1,0,0,4,2710,Physics
2838,Chiral tunnelling and the Klein paradox in gra...,The so-called Klein paradox—unimpeded penetrat...,https://doi.org/10.1038/nphys384,3227,31k,47,20 August 2006,3,4,1,1,0,0,6,1939,Physics
2475,Observation of a large-gap topological-insulat...,Topological insulators are exotic states of ma...,https://doi.org/10.1038/nphys1274,3084,48k,7,10 May 2009,1,0,0,0,0,0,1,1479,Physics
2364,Observation of parity–time symmetry in optics,A photonic system that shows behaviour similar...,https://doi.org/10.1038/nphys1515,2769,42k,29,24 January 2010,0,1,0,3,0,0,2,883,Physics
2275,Identification of influential spreaders in com...,"Spreading of information, ideas or diseases ca...",https://doi.org/10.1038/nphys1746,1978,29k,17,29 August 2010,12,0,2,0,0,0,7,1109,Physics
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2954,Predictive analyses of regulatory sequences wi...,Deep learning has become a popular tool to stu...,https://doi.org/10.1038/s43588-023-00544-w,0,11k,79,16 November 2023,95,1,0,4,0,0,0,16,Computational Science
2965,Denoising sparse microbial signals from single...,Existing genomic sequencing data can be used t...,https://doi.org/10.1038/s43588-023-00507-1,0,978,51,18 September 2023,9,1,0,7,0,0,0,18,Computational Science
2966,Cellular harmonics for the morphology-invarian...,The spatiotemporal organization of membrane-as...,https://doi.org/10.1038/s43588-023-00512-4,0,968,10,14 September 2023,17,0,0,0,0,0,0,9,Computational Science
2971,Revelation of hidden 2D atmospheric turbulence...,Turbulence exists widely in the natural atmosp...,https://doi.org/10.1038/s43588-023-00498-z,0,1022,11,10 August 2023,3,0,0,1,0,0,0,3,Computational Science
