First need to import pandas; run the cell below if you don't already have it installed!

In [None]:
!pip install pandas

In [None]:
import pandas as pd

Lets get some data to work with; we'll read a CSV of Titanic data, but check out the other pd.read_* methods, as they should handle most of your common formats.

dataframe.head() will show you the first few rows of data

In [None]:
titanic_data=pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
titanic_data.head()

Pandas by default assigns each row an index/identifier; lets use name instead, since that should be unique

In [None]:
titanic_data.set_index("Name",inplace=True)
titanic_data.head()

The recommended way to return subsets of the dataframe (which are also dataframes!) is to use .loc[row index, column index]. Use : for return all

In [None]:
titanic_data.loc["Allen, Mr. William Henry",:]

In [None]:
titanic_data.loc[["Allen, Mr. William Henry","Heikkinen, Miss. Laina"],:]

In [None]:
titanic_data.loc[["Allen, Mr. William Henry","Heikkinen, Miss. Laina"],"Sex"]

In [None]:
titanic_data.loc[:,"Survived"]

We can create new columns by doing operations on existing columns

In [None]:
titanic_data["age_survived_multiplied"]=titanic_data["Age"]*titanic_data["Survived"]

To save data out, use a dataframe.to_{format} method; most useful\common are to_csv, to_excel, to_sql, to_clipboard. These all have a ton of options, and the docmentation is really good, so make sure to use it!

In [None]:
titanic_data.to_csv("data.csv")

Dataframe.describe and .info are super useful summarizations of the data that you have

In [None]:
titanic_data.describe()

In [None]:
titanic_data.info()

We're going to join this dataset with another dataset; I'll make up some random "factors" that vary by gender

In [None]:
d = {'gender': ["male", "female"], 'factor': [3, 2]}
gender_factors = pd.DataFrame(data=d)
gender_factors

Now lets merge them! Note that the documentation on pandas is really good, and has a lot of options on how to bring datasets together

In [None]:
pd.merge(titanic_data,gender_factors,how="left", left_on="Sex",right_on="gender")

Note that if you want to do non-exact joins, check out merge_asof; you can join on nearest, nearest without going over, etc.

Groupby is another incredibly powerful option for how to analyze or summarize data

In [None]:
titanic_data.groupby("Sex").mean(numeric_only=True)

To only look at one field, slice by it

In [None]:
titanic_data.groupby("Sex")["Survived"].mean()

In [None]:
titanic_data.groupby(["Sex","Pclass"])["Survived"].mean()

Binning is a very common actuarial function, which pd.cut handles nicely

In [None]:
bins=[10,20,30,40,50]
titanic_data['age_bin'] = pd.cut(titanic_data['Age'], bins)
titanic_data

Apply can be used to apply any function to a dataframe!

In [None]:
def gender_based_calc(input_row):
    if input_row["Sex"]=="male":
        answer=0
    else:
        answer=1
    return answer

#np.where is less verbose!

titanic_data["gender_encoded"]=titanic_data.apply(gender_based_calc,axis=1)
titanic_data

Scikit learn or sklearn is a very popular machine learning library; note that we are not doing many common practices like train\test splits, cross-validation, etc, as the intent here is to illustrate ideas

In [None]:
from sklearn import tree
import matplotlib.pyplot as plt
classifier = tree.DecisionTreeClassifier(max_depth=3)

X=titanic_data.loc[:,["Pclass","gender_encoded","Age"]]
y=titanic_data["Survived"]
classifier.fit(X, y)
plt.figure(figsize=(12,12))
tree.plot_tree(classifier,feature_names=X.columns)

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y, classifier.predict(X))
accuracy