Pandas is a library of extensions to the Python language for data analysis, with names derived from the terms 'panel data' and 'Python data analysis'. It is an open source, BSD-licensed library that provides high-performance, easy-to-use data structures and data analysis tools, based on Numpy.

Pandas can import data from a variety of file formats such as CSV, JSON, SQL, Microsoft Excel, and perform arithmetic operations on a variety of data such as subsumption, reshaping, selection, as well as data cleaning and data processing features.

Pandas is widely used in various fields of data analysis such as academia, finance and statistics.

Install Pandas using pip:
!pip install pandas

Example 01. - View pandas version

In [1]:
#Examples.01
import pandas
pandas.__version__  # View pandas version

'1.2.2'

Importing pandas is generally done using the alias ‘pd’ instead.

In [2]:
import pandas as pd
pd.__version__ 

'1.2.2'

Example 02. A simple example of Pandas.

In [3]:
#Examples.02
import pandas as pd

TAdataset = {
  'Teaching Assistant': ["Colin Yuan", "Yifan Wang"],
  'E-mail': ["Y.Wang6@lboro.ac.uk", "Y.Wang6@lboro.ac.uk"]
}

TAvar = pd.DataFrame(TAdataset)

print(TAvar)

  Teaching Assistant               E-mail
0         Colin Yuan  Y.Wang6@lboro.ac.uk
1         Yifan Wang  Y.Wang6@lboro.ac.uk


Task 01. Follow the tutorial to learn functions that you could use to load data and look for some insights of the data:

Step 1. Import the necessary libraries

In [4]:
import pandas as pd

Step 2. Import the dataset by using Pandas function.

In [16]:
#pd.read_csv() reads a csv file into pandas DataFrame objects
users = pd.read_csv('https://raw.githubusercontent.com/Yandong024/MachineLearning/main/code/Tutorial/D1/data.txt', sep='|', index_col='user_id')

Step 3. Print the first 10 entries

In [5]:
users.head(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703



Step 4. Print the last 10 entries

In [6]:
users.tail(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
934,61,M,engineer,22902
935,42,M,doctor,66221
936,24,M,other,32789
937,48,M,educator,98072
938,38,F,technician,55038
939,26,F,student,33319
940,32,M,administrator,2215
941,20,M,student,97229
942,48,F,librarian,78209
943,22,M,student,77841


Step 5. What is the number of observations in the dataset?


In [7]:
users.shape[0]

943

Step 6. What is the number of attributes in the dataset?

In [8]:
users.shape[1]

4

Step 7. Print the name of all the attributes.

In [9]:
users.columns

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

Step 8. What is the data type of each attributes?

In [10]:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

Step 9. Print only the xxxx attribute column

In [11]:
#An example about the gender attribute
users.gender

user_id
1      M
2      F
3      M
4      M
5      F
      ..
939    F
940    M
941    M
942    F
943    M
Name: gender, Length: 943, dtype: object

Step 10. How many different occupations are in this dataset?

In [12]:
users.occupation.nunique()

21

Step 11. What is the most frequent occupation?

In [13]:
users.occupation.value_counts().head(1).index[0]

'student'

Step 12. Summarize the DataFrame.

In [14]:
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


Step 13. Summarize all the columns

In [16]:
users.describe(include = "all")

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


Step 14. Summarize only the occupation column.

In [17]:
users.occupation.describe()

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

Step 15. What is the mean age of users?

In [18]:
round(users.age.mean())

34