# Introduction to Python for Data Analysis


pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language


The Pandas library is one of the most important and popular tools for Python data scientists and analysts, as it is the backbone of many data projects. Pandas is an open-source Python package for data cleaning and data manipulation. It provides extended, flexible data structures to hold different types of labeled and relational data. On top of that, it is actually quite easy to install and use.

Pandas is often used in conjunction with other Python libraries. In fact, Pandas is built on the NumPy package, so a lot of the structure between them is similar. Pandas is also used in SciPy for statistical analysis or with Matplotlib for plotting functions. Pandas can be used on its own with a text editor or with Juptyer Notebooks, the ideal environment for more complex data modeling. Pandas is available for most versions of Python, including Python3.

Think of Pandas as the home for your data where you can clean, analyze, and transform your data, all in one place. Pandas is essentially a more powerful replacement for Excel. Using Pandas, you can do things like:

Easily calculate statistics about data such as finding the average, distribution, and median of columns
Use data visualization tools, such as Matplotlib, to easily create plot bars, histograms, and more
Clean your data by filtering columns by particular criteria or easily removing values
Manipulate your data flexibly using operations like merging, joining, reshaping, and more
Read, write, and store your clean data as a database, txt file, or CSV file

In [1]:
###How to import Pandas in Jupyter Notebook
import pandas as pd

In [5]:
"""How to Read Tabular Data into Pandas"""
orders=pd.read_table('http://bit.ly/chiporders')

In [6]:
orders.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [16]:
"""There is no header in this DataFrame so we j=have given header=None parameter and Names as follows"""
user_cols=['user_id','age','gender','role','zipcode']
simple=pd.read_table('http://bit.ly/movieusers',sep="|",header=None,names=user_cols)

In [20]:
simple.head()

Unnamed: 0,user_id,age,gender,role,zipcode
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


# HOW DO I SELECT A PANDAS SERIES FROM A DATAFRAME.

In [23]:
uforeports=pd.read_csv('http://bit.ly/uforeports')
uforeports.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [24]:
type(uforeports)

pandas.core.frame.DataFrame

In [40]:
type(uforeports['City'])

pandas.core.series.Series

In [41]:
uforeports['City']

0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
                 ...         
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, Length: 18241, dtype: object

In [42]:
'ab' + 'cd'

'abcd'

In [47]:
"""How to add two column and derive a new features from that"""
uforeports['Location']=uforeports.City + ', '+ uforeports.State


In [48]:
uforeports.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Location
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY"
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ"
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO"
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS"
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,"New York Worlds Fair, NY"


# Why does some pandas command end with parathesis and some command do not

In [52]:
movie_rating=pd.read_csv('http://bit.ly/imdbratings')

In [54]:
"""This is used to check the first 5 rows of the columns"""
movie_rating.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [56]:
"""Describe will only provide descriptive statistics of numerical variable only like in this case they are star_rating and 
duration"""
movie_rating.describe()

Unnamed: 0,star_rating,duration
count,979.0,979.0
mean,7.889785,120.979571
std,0.336069,26.21801
min,7.4,64.0
25%,7.6,102.0
50%,7.8,117.0
75%,8.1,134.0
max,9.3,242.0


In [57]:
movie_rating.shape

(979, 6)

In [58]:
movie_rating.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

In [59]:
type(movie_rating)

pandas.core.frame.DataFrame

In [65]:
movie_rating.describe(include=['object'])

Unnamed: 0,title,content_rating,genre,actors_list
count,979,976,979,979
unique,975,12,16,969
top,The Girl with the Dragon Tattoo,R,Drama,"[u'Daniel Radcliffe', u'Emma Watson', u'Rupert..."
freq,2,460,278,6
