# ETL Pipeline Preparation
Follow the instructions below to help you create your ETL pipeline.
### 1. Import libraries and load datasets.
- Import Python libraries
- Load `messages.csv` into a dataframe and inspect the first few lines.
- Load `categories.csv` into a dataframe and inspect the first few lines.

In [None]:
# import libraries
import pandas as pd
from time import time

import plotly
import plotly.graph_objs as go
import plotly.express as px
import numpy as np

from sqlalchemy import create_engine
from sqlalchemy.pool import NullPool
#import re
import udacourse2 #my library!
import math

In [None]:
gen_begin = time()

# load messages dataset
messages = pd.read_csv('messages.csv', index_col='id')
#messages['id'] = messages.index
#messages.info()
#messages.index
messages.head()

In [None]:
# load categories dataset
categories = pd.read_csv('categories.csv', index_col='id')
#categories.info()
categories.head()

### 6. Remove duplicates.
- Check how many duplicates are in this dataset.
- Drop the duplicates.
- Confirm duplicates were removed
- Index duplicated [here](https://stackoverflow.com/questions/35084071/concat-dataframe-reindexing-only-valid-with-uniquely-valued-index-objects)

In [None]:
#messages.index.duplicated(keep='first')
#drop index duplicated messages
messages = messages.loc[~messages.index.duplicated(keep='first')]

In [None]:
#check number of remaining duplicated messages
print(messages[messages.duplicated()].shape[0])
#drop duplicates
messages = messages.drop_duplicates()
print(messages.shape[0])
# check number of duplicates
messages[messages.duplicated()].shape[0]

### 2. Merge datasets.
- Merge the messages and categories datasets using the common id
- Assign this combined dataset to `df`, which will be cleaned in the following steps

- used SQL-type of relations, having the messages dataframe as refference
- documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

In [None]:
# merge datasets
df = pd.merge(messages, categories, left_index=True, right_index=True, how='left')
print(df.shape[0])
df.name = 'df'
df.head()

In [None]:
df.info()

### 3. Split `categories` into separate category columns.
- Split the values in the `categories` column on the `;` character so that each value becomes a separate column. You'll find [this method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.str.split.html) very helpful! Make sure to set `expand=True`.
- Use the first row of categories dataframe to create column names for the categories data.
- Rename columns of `categories` with new column names.

Solution [here](https://stackoverflow.com/questions/42049147/convert-list-to-pandas-dataframe-column)

In [None]:
# create a dataframe of the 36 individual category columns
jumber = df['categories'].iloc[0]
categories = jumber.split(sep=';')
#{'categories': categories}
categories = pd.DataFrame({'categories': categories})
categories.head()

In [None]:
# select the first row of the categories dataframe
row = categories.iloc[0]
row['categories'][:-2]

[cat[:-2] for cat in categories['categories']]

# use this row to extract a list of new column names for categories.
# one way is to apply a lambda function that takes everything 
# up to the second to last character of each string with slicing
category_colnames = [cat[:-2] for cat in categories['categories']]
print(category_colnames)

# rename the columns of `categories`
#categories.columns = category_colnames
#categories.head()

An alert flag, if there was no category inserted:

In [None]:
df['if_blank'] = False

In [None]:
#adding new columns with zero value
for colname in category_colnames:
    df[colname] = 0

#df.columns
print('new shape:',df.shape[1])
#df.head(1)

In [None]:
cell = df['categories'].iloc[0]
alfa = set(cell.split(sep=';'))
alfa

for beta in alfa:
    if beta.find('1') != -1:
        print(beta[:-2])

### 4. Convert category values to just numbers 0 or 1.
- Iterate through the category columns in df to keep only the last character of each string (the 1 or 0). For example, `related-0` becomes `0`, `related-1` becomes `1`. Convert the string to a numeric value.
- You can perform [normal string actions on Pandas Series](https://pandas.pydata.org/pandas-docs/stable/text.html#indexing-with-str), like indexing, by including `.str` after the Series. You may need to first convert the Series to be of type string, which you can do with `astype(str)`.

In [None]:
begin = time()

filtered_cols = ['categories', 'if_blank', 'related', 'request', 'offer', 'aid_related', 'medical_help', 
       'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 
       'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']

df[filtered_cols] = df[filtered_cols].apply(lambda x: udacourse2.fn_test(x, verbose=False), axis=1)

spent = time() - begin
print('elapsed time: {:.1f}s ({}min, {:.4f}sec)'.format(spent, math.trunc((spent)/60), math.fmod(spent, 60)))
#df.head()

### 5. Replace `categories` column in `df` with new category columns.
- Drop the categories column from the df dataframe since it is no longer needed.
- Concatenate df and categories data frames.

In [None]:
df = df.drop('categories', axis=1)
df.head(1)

### 7. Save the clean dataset into an sqlite database.
You can do this with pandas [`to_sql` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) combined with the SQLAlchemy library. Remember to import SQLAlchemy's `create_engine` in the first cell of this notebook to use it below.

drop table function [here](import logging
from sqlalchemy import MetaData
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL
from sqlalchemy.ext.declarative import declarative_base)

Better way to do this, [here](https://stackoverflow.com/questions/8645250/how-to-close-sqlalchemy-connection-in-mysql)

- according to the answerer, "Engine is a **factory** for connections as well as a pool of connections, not the connection itself";
- so, to avoid the problem of couldn´t closse the connection, as other members of the poll remaining asking for transactions, the answerer recommends to use **poolclass=NulPool**;
- as we are not dealing with something that really needs a pool (only one transaction per time for us is enough), let´s do it!

In [None]:
database = create_engine('sqlite:///Messages.db', poolclass=NullPool) #, echo=True)
connection = database.connect()

#attempt to save my dataframe to SQLite
try:
    df.to_sql('Messages', database, index=False, if_exists='replace')
except ValueError:
    print('something went wrong when was writing data do SQLite')
    
connection.close()

In [None]:
spent = time() - gen_begin
print('total elapsed time: {:.1f}s ({}min, {:.4f}sec)'.format(spent, math.trunc((spent)/60), math.fmod(spent, 60)))
df.head()

In [None]:
df.columns

### 8. Use this notebook to complete `etl_pipeline.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database based on new datasets specified by the user. Alternatively, you can complete `etl_pipeline.py` in the classroom on the `Project Workspace IDE` coming later.

*this was made into a new notebook named `ETL Pipeline Test.py`, and at `etl_pipeline.py`*

---

## Later tests over the generated dataset

Count the valid registers for one column, according to a **criteria**:

- `fn_count_valids` created

In [None]:
df.shape[0]

A test for the function for counting the valid labels for one column

In [None]:
total = df.shape[0]
field = 'if_blank'
count = udacourse2.fn_count_valids(dataset=df, field=field, criteria=True)
percent = 100. * (count / total)
print('{}:{} ({:.1f}%)'.format(field, count, percent))

Turning this into a function

- `fn_valids_report` created

In [None]:
udacourse2.fn_valids_report(dataset=df)

---

Labels columns, by **hierarchical structure**

*See discussion above*

In [None]:
expand_lst = ['related', 'request', 'offer', 'aid_related', 'infrastructure_related', 'weather_related', 
              'direct_report']

aid_lst = ['food', 'shelter', 'water', 'death', 'refugees', 'money', 'security', 'military', 'clothing', 
           'tools', 'missing_people', 'child_alone', 'search_and_rescue', 'medical_help', 'medical_products', 
           'aid_centers', 'other_aid']

weather_lst = ['earthquake', 'storm', 'floods', 'fire', 'cold', 'other_weather']

infrastructure_lst = ['buildings', 'transport', 'hospitals', 'electricity', 'shops', 'other_infrastructure']

To concatenate lists:

In [None]:
a = ['a', 'b']
b = ['c', 'd']
a+b

Inserting the concept into our project:
    
- `fn_count_valids` created

In [None]:
expand_list = expand_lst + aid_lst + weather_lst + infrastructure_lst

verbose = False
total = df.shape[0]
counts = []

for field in expand_list:
    count = udacourse2.fn_count_valids(dataset=df, field=field)
    percent = 100. * (count / total)
    counts.append((count, field, percent))
    if verbose:
        print('{}:{} ({:.1f}%)'.format(field, count, percent))

Order Tupples [here](https://www.pythoncentral.io/how-to-sort-a-list-tuple-or-object-with-sorted-in-python/):

- i will need it to create an **ordered report** about labels

In [None]:
sorted_tuples = sorted(counts, key=udacourse2.fn_getKey, reverse=True)

i=1
c=2
max_c=3

for cat in sorted_tuples:
    count, field, percent = cat
    print('{}-{}:{} ({:.1f}%)'.format(i, field, count, percent))
    if c > max_c:
        break
    else:
        i += 1
        c += 1

Turning this into a function:

- `fn_labels_report`created

Generic report:

In [None]:
tuples = udacourse2.fn_labels_report(dataset=df,
                                     data_ret=True,
                                     max_c=11)

Main labels counting:

>- pie at Plotly documentation charts [here](https://plotly.com/python/pie-charts/)
>- using Plotly Express (an easier way to plot graphics)
>- **Pie** charts are nice to show **relative** percentages (how in general, the labels are homogeneously distributed under a Dadatframe)

In [None]:
#tuples_main['label'][0]

In [None]:
#tuples_main['percentage'][0]

In [None]:
#df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
#df.loc[df['pop'] < 2.e6, 'country'] = 'Other countries' # Represent only large countries
#fig = px.pie(df, values='pop', names='country', title='Population of European continent')
#fig.show()

tuples_main = udacourse2.fn_labels_report(dataset=df,
                                          label_filter='main',
                                          data_ret=True,
                                          max_c=False)

fig = px.pie(tuples_main,
             names='label',
             values='percentage',
             title='Main Categories - relative percentages')

fig.show()

Main labels Total:
    
>- new use of Plotly Express, this time for **Bar Charts** [here](https://plotly.com/python/bar-charts/)
>- the **blue** color represent `this most dominant category, fragmented under a lot of **subcategories**, shown below
>- use of colors for a better representation of graphs [here](https://plotly.com/python/discrete-color/)
>- found a way to export to json [here](https://stackoverflow.com/questions/57769581/save-plot-ly-json-to-a-file)

In [None]:
tuples_main = udacourse2.fn_labels_report(dataset=df,
                                          label_filter='main',
                                          data_ret=True,
                                          max_c=False)
fig = px.bar(tuples_main, 
             x='label', 
             y='percentage',
             title='Main Categories - total percentages',
             color=['#00D', 'goldenrod', 'green', 'red'], 
             color_discrete_map="identity")
fig.show()

On `related`:

In [None]:
tuples_main = udacourse2.fn_labels_report(dataset=df,
                                          label_filter='related',
                                          data_ret=True,
                                          max_c=False)
fig = px.bar(tuples_main, 
             x='label', 
             y='percentage',
             title='Related Subdivisions')
fig.show()

On `aid_related`:

- solution strongly based on code [here](https://stackoverflow.com/questions/47489554/plotly-deactivate-x-axis-sorting)

In [None]:
plotly.offline.init_notebook_mode()
tuples_main = udacourse2.fn_labels_report(dataset=df,
                                         label_filter='related',
                                         data_ret=True,
                                         max_c=False)
data = []

data.append(go.Bar(name='aid',
                   x=tuples_main['label'], 
                   y=tuples_main['percentage']))

layout = go.Layout(barmode='stack', 
                   xaxis=dict(type='category'),
                   yaxis=dict(title='Percentage by category'),
                   title='Related Subcategories')

fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig, filename='stacked-bar')

In [None]:
plotly.offline.init_notebook_mode()
tuples_main1 = udacourse2.fn_labels_report(dataset=df,
                                           label_filter='aid',
                                           data_ret=True,
                                           max_c=False)

tuples_main1 = tuples_main1.reindex([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,0])
data = []

data.append(go.Bar(name='aid',
                   x=tuples_main1['label'], 
                   y=tuples_main1['percentage']))

layout = go.Layout(barmode='stack', 
                   xaxis=dict(type='category'),
                   yaxis=dict(title='Percentage by category'),
                   title='Aid Related')

fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig, filename='stacked-bar')

In [None]:
tuples_main2 = udacourse2.fn_labels_report(dataset=df,
                                           label_filter='weather',
                                           data_ret=True,
                                           max_c=False)
tuples_main2 = tuples_main2.reindex([0,1,2,4,5,3])

data = []

data.append(go.Bar(name='weather',
                   x=tuples_main2['label'], 
                   y=tuples_main2['percentage']))

layout = go.Layout(barmode='stack', 
                   xaxis=dict(type='category'),
                   yaxis=dict(title='Percentage by category'),
                   title='Weather Related')

fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig, filename='stacked-bar')

Infrastructure

>- my old code (don´t use it!)
>- I cannot customize it so much!

In [None]:
#tuples_main.index = ['buildings', 'transport', 'electricity', 'hospitals', 'shops', 'other_infrastructure']
#tuples_main
#fig = px.bar(tuples_main, x='label', y='percentage')
#fig.show()

In [None]:
plotly.offline.init_notebook_mode()
tuples_main3 = udacourse2.fn_labels_report(dataset=df,
                                           label_filter='infra',
                                           data_ret=True,
                                           max_c=False)
tuples_main3 = tuples_main3.reindex([0,1,3,4,5,2])

data = []

data.append(go.Bar(name='infrastructure',
                   x=tuples_main3['label'], 
                   y=tuples_main3['percentage']))

layout = go.Layout(barmode='stack', 
                   xaxis=dict(type='category'),
                   yaxis=dict(title='Percentage by category'),
                   title="Infrastructure Related")

fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig, filename='stacked-bar')

Putting them together:

- how to make this fancy graphic, at plotly [documentation](https://plotly.com/python/legend/)

In [None]:
plotly.offline.init_notebook_mode()
data = []

data.append(go.Bar(name='aid',
                   x=tuples_main1['label'], 
                   y=tuples_main1['percentage']))

data.append(go.Bar(name='weather',
                   x=tuples_main2['label'], 
                   y=tuples_main2['percentage']))

data.append(go.Bar(name='infrastructure',
                   x=tuples_main3['label'], 
                   y=tuples_main3['percentage']))

layout = go.Layout(barmode='stack', 
                   xaxis=dict(type='category'),
                   yaxis=dict(title='Percentage by category'),
                   title='Subcategories from Related (aid, weather, infrastructure)')

fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig, filename='stacked-bar') #, labels={'m', 'd', 'e'})

Related groups, with subgroups showed as colored layers:
    
>- observe that the **top** element of each bar is `other` subcategories
>- as `aid_related` has a lot of elements on `other_aid` subcategorie, a massive block appears on the top of this bar 

In [None]:
plotly.offline.init_notebook_mode()
tuples_main1 = tuples_main1.reset_index(drop=True)
tuples_main1 = tuples_main1.drop(columns=['label', 'count'], axis=1)
tuples_main1.columns = ['aid_related']

tuples_main1 = tuples_main1.reset_index(drop=True)
tuples_main2 = tuples_main2.drop(columns=['label', 'count'], axis=1)
tuples_main2.columns = ['weather_related']

tuples_main1 = tuples_main1.reset_index(drop=True)
tuples_main3 = tuples_main3.drop(columns=['label', 'count'], axis=1)
tuples_main3.columns = ['infrastructure_related']

tuples_main_tot = pd.concat([tuples_main1, tuples_main2, tuples_main3], axis=1)

data = []
related_list = ['aid_related', 'weather_related', 'infrastructure_related']

for i in range(0,tuples_main_tot.shape[0]):
    data.append(go.Bar(name='related',
                       x=related_list, 
                       y=tuples_main_tot.iloc[i]))

layout = go.Layout(barmode='stack', 
                   xaxis=dict(type='category'),
                   yaxis=dict(title='Percentage by category'),
                   title="Related with Subcategories")

fig = go.Figure(data=data, layout=layout)
fig.update_layout(showlegend=False)
plotly.offline.iplot(fig, filename='stacked-bar')

---

#### Some studies about labels

*OK, I already know that **labels** are not **features**... but we may do some **critics** about the labels that exists for multi-classification*

Strong candidates labels candidates for **ignoring**:
    
>- **related** (75.0%) $\rightarrow$ too **weighty** and looks **meaningless**
>- **child_alone** (empty) $\rightarrow$ **impossible to train**, as it don´t have any valid element in the dataset
>- **request** (17.1%), **offer** (0.4%) **direct_report** (19.4%) $\rightarrow$ looks **meaningless**

---

#### Another viewpoint about these labels

If we look at them more carefully, we can find a curious pattern on them

These labels looks as they have a kind of hierarchy behind their shape, as:

First **hierarchical** class: 

>- **related**
>- **request**
>- **offer**
>- **direct_report**

And then, **related** seems to have a **Second** hierarchical class

Features for considering a training a classifier on **two layes**, or to **group** them all in main groups, as they are clearly **collinear**:

>- **aid_related** $\rightarrow$ groups aid calling (new things to add/ to do **after** the disaster)
>>- **food**
>>- **shelter**
>>- **water**
>>- **death**
>>- **refugees**
>>- **money**
>>- **security**
>>- **military**
>>- **clothing**
>>- **tools**
>>- **missing_people**
>>- **child_alone**
>>- **search_and_rescue**
>>- **medical_help**
>>- **medical_products**
>>- **aid_centers**
>>- **other_aid**
>- **weather_related** $\rightarrow$ groups what was the main **cause** of the disaster
>>- **earthquake**
>>- **storm**
>>- **floods**
>>- **fire**
>>- **cold**
>>- **other_weather**
>- **infrastructure_related** $\rightarrow$ groups **heavy infra** that was probably dammaged during the disaster
>>- **buildings**
>>- **transport**
>>- **hospitals**
>>- **electricity**
>>- **shops**
>>- **other_infrastructure**

Let´s filter & count for one **subcategory**:

In [None]:
df[df['food'] == 1].shape[0]

Trying to have a union of two subcategories:

- how to make multifilters [here](https://stackoverflow.com/questions/13611065/efficient-way-to-apply-multiple-filters-to-pandas-dataframe-or-series)

- first trying two categories + **OR** clause

In [None]:
df[(df['food'] == 1) ^ (df['shelter'] == 1)].shape[0]

Turning into a more automatized mode

In [None]:
a = "(df['food'] == 1) ^ (df['shelter'] == 1)"

df[eval(a)].shape[0]

Preparing to turn it into a function

- `fn_cat_condenser` created

In [None]:
cat_aid_related = aid_lst
dataset = 'df'
element = '1'
opperator = '=='
condition = '^'
string = ''

for item in cat_aid_related:
    string = string + "(" + dataset + "['" + item + "'] " + opperator + " " + element + ")" + " " + condition + " "
    
string[:-3]

---

### Tests for class `aid_related` 

Counting for the main class **aid_related**:
    
- you can see that **aid_related** have more rows registered than all subclasses counted together

- this is not about training a **Machine Learning**, this is about **database data consistency**

*Just think in this way: if something is labelled as **aid_related**, so every data under it may be contained by **aid_related**. So next step, we need to correct this thing, turning **aid_related = 1** for all of them*

In [None]:
df_aid_main = df[df['aid_related'] == 1]
df_aid_main.shape[0]

Filtering for **main**, without any **sub-category** registered:

- this is **empty data** as the main category is checked, but the subcategory is not!

In [None]:
#fn_cat_condenser(subset='aid', opperation='main_not_sub')

In [None]:
main_not_sub = df[eval(udacourse2.fn_cat_condenser(subset='aid', 
                                                   name='df',
                                                   opperation='main_not_sub')[0])]
main_not_sub.shape[0]

Testing the function for grouping all **subclasses**

- counting for all subclasses of **aid_related*

- filtering for all **subcategories** with any register:

In [None]:
#fn_cat_condenser(subset='aid', opperation='all_sub')

In [None]:
all_aid_subsets = df[eval(udacourse2.fn_cat_condenser(subset='aid',
                                                      name='df',
                                                      opperation='all_sub')[0])]
all_aid_subsets.shape[0]

Filtering for **all empty** subcategories:

In [None]:
#fn_cat_condenser(subset='aid', opperation='sub_not_main')

In [None]:
sub_not_main = df[eval(udacourse2.fn_cat_condenser(subset='aid',
                                                   name='df',
                                                   opperation='empty_sub')[0])]
sub_not_main.shape[0]

---

#### Inconsistency detected

1. Just consider that our **labels** have a kind of **hierarchical structure**

2. Before running a Machine Learning **Classifier**, we can procede some **mechanical opperations** for correcting database **data inconsistencies**

3. The theoretical support for this opperations resides on **database theory** (e.g. if something is a subclass of other thing, so, the class must be setted on for each of the rows that are setted for at least one item of the subclass) 



Filtering for **empty main** with any **subcategory** registered:

>- this is a **database inconsistency** 
>- as if a **subgroup** is valid, so the **main** group must be valid too!
>- about database **normal forms**, please read [here](https://www.guru99.com/database-normalization.html)

In [None]:
#fn_cat_condenser(subset='aid', opperation='sub_not_main')

In [None]:
sub_not_main = df[eval(udacourse2.fn_cat_condenser(subset='aid',
                                                   name='df',
                                                   opperation='sub_not_main')[0])]
sub_not_main.shape[0]

Correcting our **data inconsistency**:

>- considering `opp` is only a **filter** for columns that have any value in sublabels
>- `df[eval(opp[0])]` gives our dataset rows for this filter
>- `df[eval(opp[0])].index` is only about the address to find these rows
>- finally, we need to correct `opp[1]]`, that is the name of the column, by the value `1`

In [None]:
opp = udacourse2.fn_cat_condenser(subset='aid',
                                  name='df',
                                  opperation='sub_not_main')

#df[eval(opp[0])]
#df.loc[df[eval(opp[0])].index, opp[1]] = 1 #I don´t want to correct it now!

sub_not_main = df[eval(udacourse2.fn_cat_condenser(subset='aid', opperation='sub_not_main')[0])]
sub_not_main.shape[0]

Turning this into a function

- function `fn_croup_check` created

In [None]:
df_aid = udacourse2.fn_group_check(dataset=df,
                                   subset='aid',
                                   correct=True, 
                                   shrink=True, 
                                   shorten=True, 
                                   verbose=True)
df_aid.head(5)

---

### About rows:
    
- We have 40 columns, and some rows that cannot be used to **train** any model (all their features are blank)

All the dataframe:

In [None]:
df.shape[0]

Blank labels:
    
- it's not a good idea training with this data;

- they have **zero** classification (no labels at all!)

In [None]:
df[df['if_blank']].shape[0]

Labels with some content:

In [None]:
df[~df['if_blank']].shape[0]

---

## Test area (can be removed later)

In [None]:
raise Exception('Test area')

In [None]:
full_path = 'c://host/pyprog/project/udacourse/categories.csv'
full_path = 'c://host/pyprog/project/udacourse/messages.csv'
last_one = full_path.rfind('/')
full_path[last_one+1:-4]

In [None]:
categories.index.name = 'categories'
categories.index.name

In [None]:
uma = 0.70
duas = 0.305
primeira = 0.35 #0.51
segunda = 0.67 #0.74

delta = ((uma - duas) * primeira) + (duas * segunda)
delta

In [None]:
uma = 0.70
duas = 0.305
primeira = 0.70
segunda = 0.88

original = ((uma - duas) * primeira) + (duas * segunda)
original

In [None]:
primeira = 0.51
segunda = 0.74

alfa = ((uma - duas) * primeira) + (duas * segunda)
alfa

In [None]:
def fn_teste(dataset):
    #print(dataset.iloc[0])
    print(locals())

In [None]:
fn_teste(dataset=df)

In [None]:
pd.DataFrame(tuples, columns = ['label', 'count', 'percentage'])

In [None]:
import plotly.plotly as py
import plotly.graph_objs as go
import pandas as pd


df = pd.read_csv('C:/Users/Documents/Python/CKANMay.csv')
sd = df.nlargest(3,'Views')
fd = sd.sort_values(by='Views', ascending = False)


my_data = [go.Bar( x = fd.Views, y = fd.Publisher, orientation = 'h')]
my_layout = ({"title": "Most popular publishers",
                       "yaxis": {"title":"Publisher"},
                       "xaxis": {"title":"Views"},
                       "showlegend": False})

fig = go.Figure(data = my_data, layout = my_layout)

py.iplot(fig)