Importing a dataset using the pandas .read_csv() function:

In [22]:
import pandas as pd
dataset_name = pd.read_csv('filename.csv')
print(dataset_name)

  Player Name  Games Played  Points Per Game  Assists Per Game  \
0    Player 1            60             19.6               9.0   
1    Player 2            53             28.8               8.7   
2    Player 3            67             27.0               9.3   
3    Player 4            79             27.6               1.6   
4    Player 5            58             18.0               9.4   
5    Player 6            81             28.3               1.7   
6    Player 7            54             24.7              10.0   
7    Player 8            67             11.6               5.6   
8    Player 9            78             10.3               6.7   
9   Player 10            81              5.5               1.0   

   Rebounds Per Game  Steals Per Game  Field Goal Percentage  
0                2.5              1.9                   58.6  
1               11.0              1.0                   46.0  
2                3.3              1.1                   42.0  
3               11.9 

If we want to preview the first few rows of the dataset, we can use the pandas method .head() which displays the first five rows of the dataset:

In [23]:
dataset_name.head()

Unnamed: 0,Player Name,Games Played,Points Per Game,Assists Per Game,Rebounds Per Game,Steals Per Game,Field Goal Percentage
0,Player 1,60,19.6,9.0,2.5,1.9,58.6
1,Player 2,53,28.8,8.7,11.0,1.0,46.0
2,Player 3,67,27.0,9.3,3.3,1.1,42.0
3,Player 4,79,27.6,1.6,11.9,1.8,57.5
4,Player 5,58,18.0,9.4,11.5,2.5,44.6


When we import datasets, pandas assigns a data type to each column. These assignments aren’t always correct, so it's a good practice to check to see that the assignments match what you're expecting.  To check the data types of our columns, we can use the .dtypes attribute from pandas:

In [24]:
dataset_name.dtypes

Player Name               object
Games Played               int64
Points Per Game          float64
Assists Per Game         float64
Rebounds Per Game        float64
Steals Per Game          float64
Field Goal Percentage    float64
dtype: object

We can access specific columns of a dataframe using square bracket notation.  If we wanted to look at the 'Games Played' column of our csv file, we could write:

In [25]:
dataset_name['Games Played']

0    60
1    53
2    67
3    79
4    58
5    81
6    54
7    67
8    78
9    81
Name: Games Played, dtype: int64

To extract more than one column at a time, we pass a list of column names to the square brackets:

In [26]:
dataset_name[['Games Played', 'Player Name']]

Unnamed: 0,Games Played,Player Name
0,60,Player 1
1,53,Player 2
2,67,Player 3
3,79,Player 4
4,58,Player 5
5,81,Player 6
6,54,Player 7
7,67,Player 8
8,78,Player 9
9,81,Player 10


Pandas provides a built-in method .value_counts() that not only lists all unique values of a column, but also counts how often each value occurs.  By default, this sorts from larget to smallest.  We have the ability to modify the parameters of the .value_counts method to return the data in a different way:

- .value_counts(normalize=True) returns percentages instead of raw numbers
- .value_counts(ascending=True) will sort from smallest to largets 

In [27]:
dataset_name['Games Played'].value_counts()

67    2
81    2
60    1
53    1
79    1
58    1
54    1
78    1
Name: Games Played, dtype: int64

While .value_counts() can be used on numeric columns (especially when they are categorical), questions about numeric data often are more statistical. For example, we might want to know the largest and smallest value in a numeric column.

The Series method .describe() calculates statistical information about the numbers in a numeric column.

In [28]:
dataset_name['Games Played'].describe()

count    10.000000
mean     67.800000
std      11.282238
min      53.000000
25%      58.500000
50%      67.000000
75%      78.750000
max      81.000000
Name: Games Played, dtype: float64

The information we receive includes:

- count: the number of numbers in the column
- mean: the average of the numbers in the column
- std: the standard deviation of the numbers in the column
- min: the minimum of the numbers in the column
- 25%, 50%, 75%: the 25th, 50th, and 75th percentiles of the numbers in the column
- max: the maximum of the numbers in the column

Note that in each case, the statistic is based on the numbers in the column. Most real-world datasets are missing data in some columns. These statistics ignore those missing entries, since we can’t know what the missing values should have been.

The .describe() method can be used on columns with the object type, but the output is different.

In [29]:
dataset_name['Player Name'].describe()

count           10
unique          10
top       Player 1
freq             1
Name: Player Name, dtype: object

The information we receive includes:

- count: the number of (non-missing) entries in the column.
- unique: the number of unique values. 
- top: the most frequent value
- freq: the number of times the top value appears