# Data Analysis with Pandas


The "pandas" module is specifically useful for computing and finding patterns in data and hence it is one of the most useful tools for any data scientist/engineer. However, most of the time you would not be using pandas alone, but you would possibly be using modules like __scipy__ and __matplotlib__ along with pandas to achieve your ends. Here, we will be focusing on pandas only, but in our examples we will make use of scipy and __matplotlib__ to display data distribution and such things.

Pandas provides the data scientist/analyst with three different (and very useful data structures) named "series", "dataframe" and "panel". 


## Series

A “series” is nothing but a labelled column of data that can hold any datatype (int, float, string, objects, etc). A pandas “series” can be created using the following constructor call:

```python
pandas.Series(data, index, dtype, copy)
```
The argument "data" is a list of data elements (mostly passed as a numpy ndarray), "index" is a unique hashable list with the same length as the "data" argument. "dtype" defines the data type (Series is a homogeneous collection of elements), and "copy" specifies if copy flag is set. By default, this is false.

## DataFrame

Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.

DataFrames in Python are very similar: they come with the Pandas library, and they are defined as two-dimensional labeled data structures with columns of potentially different types.

In general, you could say that the Pandas DataFrame consists of three main components: the data, the index, and the columns.


1. The DataFrame can contain data that is:
<ul>
    <li>a Pandas DataFrame</li>
    <li>a Pandas Series: a one-dimensional labeled array capable of holding any data type with axis labels or index. An example of a Series object is one column from a DataFrame.</li>
    <li>a NumPy ndarray, which can be a record or structured<li>
    <li>a two-dimensional ndarray</li>
    <li>dictionaries of one-dimensional ndarray's, lists, dictionaries or Series.</li>
<ul>

Besides data, you can also specify the index and column names for your DataFrame. The index, on the one hand, indicates the difference in rows, while the column names indicate the difference in columns. You will see later that these two components of the DataFrame will come in handy when you’re manipulating your data.

 ### Creating, Reading and Writing
 
 ### Getting started

To use pandas, you'll typically start with the following line of code.

In [None]:
import pandas as pd

### Creating Data

There are two core objects in pandas: the DataFrame and the Series.


### Series

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

In [None]:
pd.Series([1, 2, 3, 4, 5], index=['A','B', 'C','D','E'])

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

In [None]:
pd.Series?

__Assign names to our values__

Pandas will automatically generate our indexes, so we need to define them. Each index corresponds to its value in the Series object. Let’s look at an example where we assign a country name to population growth rates.

In [None]:
Data = pd.Series([30, 35.5, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
Data

In [None]:
Data.index.name = "Year" # Set index name 
Data

In [None]:
Data.name="New Products"

In [None]:
Data

__Select entries from a Series__

To select entries from a Series, we select elements based on the index name or index number.

In [None]:
Data["2016 Sales"] #Fetch element at index named 2016 Sales

In [None]:
Data[-1]

In [None]:
Data[[0,2]]   #Fetch elements at multiple indexes 

__Drop entries from a Series__

Dropping and unwanted index is a common function in Pandas. If the drop(index_name) function is called with a given index on a Series object, the desired index name is deleted.

In [None]:
print(" Original Data", Data)

In [None]:
Data = Data.drop("2016 Sales") # Drop index named 2016 Sales

In [None]:
print("New Data ",Data)

In [None]:
Data.

__DataFrame__

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

For example, consider the following simple DataFrame:

In [None]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]}, )

In this example, the "0, No" entry has the value of 131. The "0, Yes" entry has a value of 50, and so on.

DataFrame entries are not limited to integers. For instance, here's a DataFrame whose values are strings:

In [None]:
pd.DataFrame({'Abebe': ['ወድጀዋለሁ.', 'It was awful.'], 'Kebede': ['Pretty good.', 'Bland.']})

We are using the pd.DataFrame() constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 $(0, 1, 2, 3,\cdots)$ for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

In [None]:
pd.DataFrame({'አበበ': ['ወድጀዋለሁ.', 'It was awful.'], 
              'ከበደ': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

__Reading data files__

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file.

Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the __pd.read_csv()__ function to read the data into a DataFrame. This goes thusly:

In [None]:
pd.read

In [None]:
wine_reviews = pd.read_csv("wd.csv", index_col=0)

In [None]:
wine_reviews.head()

We can use the shape attribute to check how large the resulting DataFrame is:

In [None]:
wine_reviews.shape

In [None]:
wine_reviews['price']

In [None]:
wine_reviews.tail()

So our new DataFrame has 150,000 records split across 14 different columns. That's almost 2 million entries!

We can examine the contents of the resultant DataFrame using the __head()__ command, which grabs the first five rows:

In [None]:
wine_reviews.head(10)

The pd.read_csv() function is well-endowed, with over 30 optional parameters you can specify. For example, you can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an index_col.

In [None]:
wine_reviews.info()

In [None]:
wine_reviews.describe(include='all')

## Select an index or Column from a Pandas DataFrame

In Python, we can access the property of an object by accessing it as an attribute. A book object, for example, might have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work in much the same way.

Hence to access the country property of reviews we can use:


In [None]:
wine_reviews.points # wine_reviews["points"]

If we have a Python dictionary, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame:

In [None]:
wine_reviews[['country','points']][:10]

In [None]:
wine_reviews.iloc[-10:,[0,3,4]]

These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator [] does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a country providence column, reviews.country providence wouldn't work).

Doesn't a pandas Series look kind of like a fancy dictionary? It pretty much is, so it's no surprise that, to drill down to a single specific value, we need only use the indexing operator [] once more:


In [None]:
wine_reviews['country'][4]

## Indexing in Pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you're supposed to be using.
#### Index-based selection

Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:


In [None]:
wine_reviews.iloc[0:5,[3,4]]

In [None]:
wine_reviews.loc[0]

Both _loc_ and _iloc_ are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

In [None]:
wine_reviews.iloc[:,4]

On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the country column from just the first, second, and third row, we would do:

In [None]:
wine_reviews.iloc[:3, 0]

Or, to select just the second and third entries, we would do:

In [None]:
wine_reviews.iloc[1:3, 0]

It's also possible to pass a list:

In [None]:
import numpy as np

In [None]:
wine_reviews.iloc[np.arange(0,10,2), 0]

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values. So for example here are the last five elements of the dataset.

In [None]:
wine_reviews['province']

In [None]:
wine_reviews.iloc[-5:]

__Label-based selection__

The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

For example, to get the first entry in reviews, we would now do the following:


In [None]:
wine_reviews.loc[4, 'country']

__iloc__ is conceptually simpler than __loc__ because it ignores the dataset's indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead. For example, here's one operation that's much easier using loc:

In [None]:
wine_reviews.loc[:5, ['points','variety','winery']]

## Manipulating the index

Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit.

The set_index() method can be used to do the job. Here is what happens when we set_index to the title field:

In [None]:
wine_reviews.head(1)

In [None]:
wine_reviews.set_index("province")


In [None]:
wine_reviews

## Conditional selection

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

For example, suppose that we're interested specifically in better-than-average wines produced in Italy.

We can start by checking if each wine is Italian or not:

In [None]:
wine_reviews.head()

In [None]:
wine_reviews[wine_reviews.price>87]

This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data:

In [None]:
wine_reviews[wine_reviews.country == 'Italy']

This DataFrame has ~20,000 rows. The original had ~150,000. That means that around 13% of wines originate from Italy.

We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points.

We can use the ampersand (&) to bring the two questions together:


In [None]:
wine_reviews.loc[(wine_reviews.country == 'Italy') & (wine_reviews.points >= 90)]

Suppose we'll buy any wine that's made in Italy or which is rated above average. For this we use a pipe (|):

In [None]:
wine_reviews.loc[(wine_reviews.country == 'Italy') | (wine_reviews.points >= 90)]

Pandas comes with a few built-in conditional selectors, two of which we will highlight here.

The first is isin. isin is lets you select data whose value "is in" a list of values. For example, here's how we can use it to select wines only from Italy or France:


In [None]:
import scipy.stats as st

In [None]:
Data = pd.read_excel('https://api.worldbank.org/v2/en/country/ETH?downloadformat=excel')

In [None]:
Data.head()

In [None]:
Data.columns = Data.iloc[2,:]

Data.head()

In [None]:
Data = Data.iloc[3:,:]
Data.head()


In [None]:
Data.reset_index(inplace=True)
Data.head()

In [None]:
A = Data.columns[5:]

In [None]:
A = list(A.astype(int))

In [None]:
Data.columns[5:] = 

In [None]:
wine_reviews.loc[wine_reviews.country.isin(['Italy', 'France'])]

In [None]:
wine_reviews.head(10)

The second is __isnull__ (and its companion __notnull__). These methods let you highlight values which are (or are not) empty (NaN). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:

In [None]:
wine_reviews.loc[wine_reviews.region_2.notnull()]

Before you can get to the solution, it’s first a good idea to grasp the concept of loc and how it differs from other indexing attributes such as __.iloc[]__ and __.ix[]__:

__loc[]__ works on labels of your index. This means that if you give in __loc[2]__, you look for the values of your DataFrame that have an index labeled 2.

__iloc[]__ works on the positions in your index. This means that if you give in __iloc[2]__, you look for the values of your DataFrame that are at index ’2`.

__ix[]__ is a more complex case: when the index is integer-based, you pass a label to __.ix[]__. __ix[2]__ then means that you’re looking in your DataFrame for values that have an index labeled 2. This is just like __.loc[]__! However, if your index is not solely integer-based, __ix__ will work with positions, just like __.iloc[]__.


## Assigning data

Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

In [None]:
wine_reviews['critic'] = 'everyone'

In [None]:
wine_reviews.price.describe()

Or with an iterable of values:

In [None]:
wine_reviews['critic'] = range(len(wine_reviews), 0, -1)

In [None]:
wine_reviews.head(3)

## Summary functions

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the describe() method:

In [None]:
wine_reviews.points.describe()

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [None]:
wine_reviews.winery.describe()

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen.

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function:


In [None]:
wine_reviews.points.quantile(0.95)

To see a list of unique values we can use the unique() function:

In [None]:
len(wine_reviews.winery.unique())

To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:

In [None]:
wine_reviews.winery.value_counts()


#### Maps

A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:


In [None]:
review_points_mean = wine_reviews.points.mean()

In [None]:
review_points_mean

In [None]:
wine_reviews.points - review_points_mean

In [None]:
wine_reviews.points.map(lambda p: p - review_points_mean)

The function you pass to __map()__ should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

__apply()__ is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.


In [None]:
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

In [None]:
wine_reviews.apply(remean_points, axis='columns')

If we had called reviews.apply() with __axis='index'__, then instead of passing a function to transform each row, we would need to give a function to transform each column.

Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.


In [None]:
wine_reviews.head(1)

Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:

In [None]:
review_points_mean = wine_reviews.points.mean()
wine_reviews.points - review_points_mean

In [None]:
wine_reviews.country + " - " + wine_reviews.region_1

## Groupwise analysis

One function we've been using heavily thus far is the value_counts() function. We can replicate what value_counts() does by doing the following:

In [None]:
wine_reviews.groupby('points').points.count()

## Merging with Pandas

Merging is used when we want to collect data that shares a key variable but are located in different DataFrames. To merge DataFrames, we use the merge() function. Say we have df1 and df2.

In [None]:
d = {
    'subject_id': ['1', '2', '3', '4', '5'],
    'student_name': ['Mark', 'Khalid', 'Deborah', 'Trevon', 'Raven']
}
df1 = pd.DataFrame(d, columns=['subject_id', 'student_name'])
df1

In [None]:
data = {
    'subject_id': ['4', '5', '6', '7', '8'],
    'student_name': ['Eric', 'Imani', 'Cece', 'Darius', 'Andre']
}
df2 = pd.DataFrame(data, columns=['subject_id', 'student_name'])
df2

So, how do we merge them? It’s simple: with the merge() function!

In [None]:
pd.merge(df1, df2, on='student_name')

## Grouping with Pandas

Grouping is how we categorize our data. If a value occurs in multiple rows of a single column, the data related to that value in other columns can be grouped together. Just like with merging, it’s more simple than it sounds. We use the groupby function. Look at this example.

In [None]:
raw = {
    'Name': ['Darell', 'Darell', 'Lilith', 'Lilith', 'Tran', 'Tran', 'Tran',
        'Tran', 'John', 'Darell', 'Darell', 'Darell'],
    'Position': [2, 1, 1, 4, 2, 4, 3, 1, 3, 2, 4, 3],
    'Year': [2009, 2010, 2009, 2010, 2010, 2010, 2011, 2012, 2011, 2013, 2013, 2012],
    'Marks':[408, 398, 422, 376, 401, 380, 396, 388, 356, 402, 368, 378]
}
df = pd.DataFrame(raw)
df

In [None]:
group = df.groupby('Year')
group.get_group(2010)

## Concatenation

Concatenation is a long word that means to add a set of data to another. We use the concat() function to do so. To clarify the difference between merge and concatenation, merge() combines data on shared columns, while concat() combines DataFrames across columns or rows.

In [None]:
df1

In [None]:
df2

In [None]:
pd.concat([df1,df2])

## Downloading data sets from a url

In [None]:
download_url = ("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv")
df = pd.read_csv(download_url)

In [None]:
df.head()

## Create Your First Pandas Plot

Your dataset contains some columns related to the earnings of graduates in each major:
<ul>
    <li> "Median" is the median earnings of full-time, year-round workers.</li>
    <li> "P25th" is the 25th percentile of earnings.</li>
    <li> "P75th" is the 75th percentile of earnings.</li>
    <li> "Rank" is the majors rank by median earnings.</li>
</ul>
Let's start with a plot displaying these columns. First, you need to set up your Jupyter Notebook to display plots with the __%matplotlib__ magic command:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

In [None]:
df.plot(x="Rank", y=["P25th", "Median", "P75th"])
plt.show()

Looking at the plot, you can make the following observations:
<ul>
    <li>The median income decreases as rank decreases. This is expected because the rank is determined by the median income.</li>
    <li>Some majors have large gaps between the 25th and 75th percentiles. People with these degrees may earn significantly less or significantly more than the median income.</li>
    <li>Other majors have very small gaps between the 25th and 75th percentiles. People with these degrees earn salaries very close to the median income.</li>
</ul>
Your first plot already hints that there's a lot more to discover in the data! Some majors have a wide range of earnings, and others have a rather narrow range. To discover these differences, you'll use several other types of plots.



.plot() has several optional parameters. Most notably, the kind parameter accepts eleven different string values and determines which kind of plot you’ll create:
<ol>
    
    "area" is for area plots.
    "bar" is for vertical bar charts.
    "barh" is for horizontal bar charts.
    "box" is for box plots.
    "hexbin" is for hexbin plots.
    "hist" is for histograms.
    "kde" is for kernel density estimate charts.
    "density" is an alias for "kde".
    "line" is for line graphs.
    "pie" is for pie charts.
    "scatter" is for scatter plots.
</ol>
The default value is __"line"__. Line graphs, like the one you created above, provide a good overview of your data. You can use them to detect general trends. They rarely provide sophisticated insight, but they can give you clues as to where to zoom in.

If you don’t provide a parameter to .plot(), then it creates a line plot with the index on the x-axis and all the numeric columns on the y-axis. While this is a useful default for datasets with only a few columns, for the college majors dataset and its several numeric columns, it looks like quite a mess.

## Distributions and Histograms

DataFrame is not the only class in pandas with a .plot() method. As so often happens in pandas, the Series object provides similar functionality.

You can get each column of a DataFrame as a Series object. Here’s an example using the "Median" column of the DataFrame you created from the college major data:

In [None]:
median_column = df["Median"]
type(median_column)

In [None]:
median_column.plot(kind="hist")
plt.show()

The histogram shows the data grouped into ten bins ranging from $\$20,000$ to $\$120,000$, and each bin has a width of $10,000. The histogram has a different shape than the normal distribution, which has a symmetric bell shape with a peak in the middle.

The histogram of the median data, however, peaks on the left below $\$40,000$. The tail stretches far to the right and suggests that there are indeed fields whose majors can expect significantly higher earnings.

## Outliers

Have you spotted that lonely small bin on the right edge of the distribution? It seems that one data point has its own category. The majors in this field get an excellent salary compared not only to the average but also to the runner-up. Although this isn't its main purpose, a histogram can help you to detect such an outlier. Let's investigate the outlier a bit more:
<ul>
    <li>Which majors does this outlier represent?</li>
    <li>How big is its edge?</li>
</ul>
Contrary to the first overview, you only want to compare a few data points, but you want to see more details about them. For this, a bar plot is an excellent tool. First, select the five majors with the highest median earnings. You'll need two steps:
<ul>
    <li>To sort by the "Median" column, use .sort_values() and provide the name of the column you want to sort by as well as the direction ascending=False.</li>
    <li>To get the top five items of your list, use .head().</li>
</ul>
Let's create a new DataFrame called top_5:

In [None]:
top_5 = df.sort_values(by="Median", ascending=False).head()

Now you have a smaller DataFrame containing only the top five most lucrative majors. As a next step, you can create a bar plot that shows only the majors with these top five median salaries:

In [None]:
top_5.plot(x="Major", y="Median", kind="bar", rot=45, fontsize=8)
plt.show()

Notice that you use the rot and fontsize parameters to rotate and size the labels of the x-axis so that they're visible. You'll see a plot with 5 bars:

This plot shows that the median salary of petroleum engineering majors is more than $20,000 higher than the rest. The earnings for the second- through fourth-place majors are relatively close to one another.

If you have a data point with a much higher or lower value than the rest, then you’ll probably want to investigate a bit further. For example, you can look at the columns that contain related data.

Let’s investigate all majors whose median salary is above $60,000. First, you need to filter these majors with the mask _df[df["Median"] > 60000]_. Then you can create another bar plot showing all three earnings columns:

In [None]:
top_medians = df[df["Median"] > 60000].sort_values("Median")
top_medians.plot(x="Major", y=["P25th", "Median", "P75th"], kind="bar",)
plt.show()

The 25th and 75th percentile confirm what you’ve seen above: petroleum engineering majors were by far the best paid recent graduates.

Why should you be so interested in outliers in this dataset? If you’re a college student pondering which major to pick, you have at least one pretty obvious reason. But outliers are also very interesting from an analysis point of view. They can indicate not only industries with an abundance of money but also invalid data.

Invalid data can be caused by any number of errors or oversights, including a sensor outage, an error during the manual data entry, or a five-year-old participating in a focus group meant for kids age ten and above. Investigating outliers is an important step in data cleaning.

Even if the data is correct, you may decide that it’s just so different from the rest that it produces more noise than benefit. Let’s assume you analyze the sales data of a small publisher. You group the revenues by region and compare them to the same month of the previous year. Then out of the blue, the publisher lands a national bestseller.

This pleasant event makes your report kind of pointless. With the bestseller’s data included, sales are going up everywhere. Performing the same analysis without the outlier would provide more valuable information, allowing you to see that in New York your sales numbers have improved significantly, but in Miami they got worse.

## Check for Correlation

Often you want to see whether two columns of a dataset are connected. If you pick a major with higher median earnings, do you also have a lower chance of unemployment? As a first step, create a scatter plot with those two columns:

In [None]:
df.plot(x="Median", y="Unemployment_rate", kind="scatter")
plt.show()

You should see a quite random-looking plot, like this: scatter plot median unemployment

A quick glance at this figure shows that there’s no significant correlation between the earnings and unemployment rate.

While a scatter plot is an excellent tool for getting a first impression about possible correlation, it certainly isn’t definitive proof of a connection. For an overview of the correlations between different columns, you can use .corr(). If you suspect a correlation between two values, then you have several tools at your disposal to verify your hunch and measure how strong the correlation is.

Keep in mind, though, that even if a correlation exists between two values, it still doesn’t mean that a change in one would result in a change in the other. In other words, correlation does not imply causation.

## Analyze Categorical Data

To process bigger chunks of information, the human mind consciously and unconsciously sorts data into categories. This technique is often useful, but it’s far from flawless.

Sometimes we put things into a category that, upon further examination, aren’t all that similar. In this section, you’ll get to know some tools for examining categories and verifying whether a given categorization makes sense.

Many datasets already contain some explicit or implicit categorization. In the current example, the 173 majors are divided into 16 categories.

## Grouping

A basic usage of categories is grouping and aggregation. You can use .groupby() to determine how popular each of the categories in the college major dataset are:

In [None]:
cat_totals = df.groupby("Major_category")["Total"].sum().sort_values()

In [None]:
cat_totals

In [None]:
cat_totals.plot(kind="barh", fontsize=4)
plt.show()

As your plot shows, business is by far the most popular major category. While humanities and liberal arts is the clear second, the rest of the fields are more similar in popularity.

## Determining Ratios

Vertical and horizontal bar charts are often a good choice if you want to see the difference between your categories. If you’re interested in ratios, then pie plots are an excellent tool. However, since cat_totals contains a few smaller categories, creating a pie plot with cat_totals.plot(kind="pie") will produce several tiny slices with overlapping labels .

To address this problem, you can lump the smaller categories into a single group. Merge all categories with a total under 100,000 into a category called "Other", then create a pie plot:

In [None]:
small_cat_totals = cat_totals[cat_totals < 100_000]

In [None]:
big_cat_totals = cat_totals[cat_totals > 100_000]

In [None]:
# Adding a new item "Other" with the sum of the small categories
small_sums = pd.Series([small_cat_totals.sum()], index=["Other"])
big_cat_totals = big_cat_totals.append(small_sums)
big_cat_totals.plot(kind="pie", label="")
plt.show()

Notice that you include the argument label="". By default, pandas adds a label with the column name. That often makes sense, but in this case it would only add noise.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = np.random.normal(0, 1, 3)
# array([-1.18878589,  0.59627021,  1.59895721])plt.figure(figsize=(16, 6))
sns.boxplot(x=data);

In [None]:
from bokeh.plotting import figure, output_file, show

# prepare some data
x = [0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0]
y0 = [i**2 for i in x]
y1 = [10**i for i in x]
y2 = [10**(i**2) for i in x]

# output to static HTML file
output_file("log_lines.html")

# create a new plot
p = figure(
   tools="pan,box_zoom,reset,save",
   y_axis_type="log", y_range=[0.001, 10**11], title="log axis example",
   x_axis_label='sections', y_axis_label='particles'
)

# add some renderers
p.line(x, x, legend="y=x")
p.circle(x, x, legend="y=x", fill_color="white", size=8)
p.line(x, y0, legend="y=x^2", line_width=3)
p.line(x, y1, legend="y=10^x", line_color="red")
p.circle(x, y1, legend="y=10^x", fill_color="red", line_color="red", size=6)
p.line(x, y2, legend="y=10^x^2", line_color="orange", line_dash="4 4")

# show the results
show(p)

In [None]:
import plotly.graph_objects as go
fig = go.Figure(data=go.Bar(y=[2, 3, 1]))
fig.show()