## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
# File location and type
# BDM -- changed to part_*_ as I have loaded parquet files 0000-0004  -- more data -- better lda -- I hope
file_location = "/FileStore/tables/part_r_*_fbc86a65_*.parquet"
file_type = "parquet"

# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)


display(df)

mid,sender,timestamp,omid,subject,body,folder,body_cleaned,concepts
1032,keith.petersen@enron.com,2001-08-03 10:52:10,<2563307.1075860881637.JavaMail.evans@thyme>,RE: Sun Devil open season draft,"Kevin, the document looks good. I did forward to Ranelle Paladino in Steve Kirk s group to look at it from a tariff perspective.Keith -----Original Message-----From: Hyatt, Kevin Sent:	Friday, August 03, 2001 11:23 AMTo:	Petersen, Keith; Winckowski, Michele; Corman, ShelleySubject:	Sun Devil open season draftImportance:	High << File: sundevilopenseason3.doc >> Michele and Keith,	Drew asked that I send this to you to help backstop Rob Kilmer and Teb Lokey. I have incorporated their changes. We re still waiting on final rate numbers to fill in the blanks, but we re close. I d like to get this in the trade press and on TW s website on Monday 8/6. Please call me at 853-5559 if you have any comments or concerns.thanksKevin Hyatt","Kevin_Hyatt_Mar2002Hyatt, KevinProjectsSun Devil","List(kevin, document, look, forward, ranelle, paladino, steve, kirk, look, tariff, perspective, keith, -original, message, -from, hyatt, kevin, sent, friday, august, 2001, 11:23, amto, petersen, keith, winckowski, michele, corman, shelleysubject, sun, devil, season, draftimportance, file, sundevilopenseason3, doc, michele, keith, drew, send, help, backstop, rob, kilmer, teb, lokey, incorporated, change, waiting, final, rate, fill, blank, close, trade, press, website, monday, 8/6, please, call, 853-5559, comment, concern, thankskevin)",List()
1232,info@pmaconference.com,2002-03-06 10:52:18,<3487296.1075860877232.JavaMail.evans@thyme>,Power Executive-- Free Sample Issue,"We welcome you to download a freetrial issue of New Power Executive, the market s leading weekly on market intel, trend analysis, and a host of strategic issues and ideas of direct importance to your planning and business development process. Read by thousands of senior execs around the market, New Power Executive is has been considered a must-read publication since 1997. These four free issues you ll receive over the coming weeks are free and have zero obligation. However, if you choose to formally subscribe during this trial you can save $150 off the regular rate. See the order form on page 5. Enjoy! This email has been sent to kevin.hyatt@enron.com, by PowerMarketers.com.Visit our Subscription Center to edit your interests or unsubscribe.http://ccprod.roving.com/roving/d.jsp?p=oo&m=1000838503237&ea=kevin.hyatt@enron.comView our privacy policy: http://ccprod.roving.com/roving/CCPrivacyPolicy.jspPowered byConstant Contact(R)www.constantcontact.com","Kevin_Hyatt_Mar2002Hyatt, KevinInbox",List(welcome),List()
1432,kevin.hyatt@enron.com,2002-02-25 07:46:38,<8699033.1075860874683.JavaMail.evans@thyme>,Pioneer Gas Pipeline,"Bill, I need a confidentiality agrmt for a proposed pipe sale transaction.We are evaluating possibly selling a lateral piece of TW pipe to Pioneer. They are interested in purchasing approx. 35 miles of pipe that TW took out of service in 2000. The pipe in located in Ward and Pecos County, Texas. Once the C.A. is executed, we would let Pioneer begin due diligence covering information such as easements, maintenance records, physical location, etc.Pioneer is an owner/operator of gas gathering & intrastate pipe transmission assets. Their info is as follows:Pioneer Gas Pipeline, Inc.Mr. Philip R. Allard, President502 S. Koenigheim, Suite 3ASan Angelo, Texas 76903ph 915-655-3300fax 915-655-3315let me know if you need any further info.thxKevin Hyattx35559","Kevin_Hyatt_Mar2002Hyatt, KevinSent Items","List(bill, confidentiality, agrmt, proposed, pipe, sale, transaction, evaluating, possibly, selling, lateral, piece, pipe, pioneer, purchasing, approx, mile, pipe, service, 2000., pipe, located, ward, pecos, county, texas, executed, pioneer, due, diligence, covering, information, easement, maintenance, record, physical, location, etc, pioneer, owner/operator, gas, gathering, intrastate, pipe, transmission, asset, info, follows, pioneer, gas, pipeline, inc., mr., philip, allard, president502, koenigheim, suite, 3asan, angelo, texas, 76903ph, 915-655-3300fax, 915-655-3315let, info, thxkevin)",List()
1632,jeffery.fawcett@enron.com,2001-02-06 20:22:00,<2362670.1075861767847.JavaMail.evans@thyme>,Big Sandy Project,"Kirk,We ve put together a quick estimate of the cost to install the interconnect and measurement facilities off Transwestern. The estimate was based on the original operating conditions of Phase 1 (80 MMcf/d) and Phase 2 (120 MMcf/d) given to us early in discussions with George Briden. The subject facilities will include a 16\"" tie-in to both TW 30\"" mainlines (a redundancy feature, allowing continued use of one line if the other line is temporarily taken out of service for maintenance or is the subject of some type of failure), mainline valves, turbine meter, gallagher flow conditioner, telemetry, flow control, etc. The order of magnitude (+/-30%) estimated cost is $416,700. Not knowing your design and proposed schematic showing joint operations with El Paso and/or Southern Trails, this estimate does not include a pressure control and/or pressure let down equipment. Consequently, the project will have to control pressure in the lateral system. As you know, it s operationally problematic to design for both pressure and flow control into a common header system accepting deliveries from multiple pipelines operating at different pressures. However, Transwestern will be amenable to working with the project and the other interstate pipelines to optimize the ultimate design of all metering and lateral pipeline facilities.Also, the $416K estimate does not include any build up for income tax. If the project reimburses Transwestern for the cost of Transwestern constructing the interconnect and metering station, Transwestern will recognize a taxable event. Use approx. 30% as an effective corporate tax rate for purposes of calculating the price build-up. As a means to avoid the tax consequence, Transwestern is agreeable to the idea of the project building the metering station to Transwestern s specifications, with Transwestern operating the facilities under an Operating Agreement. We can discuss these structuring questions as the project develops and we get to the point of executing project documents for the provisioning of service to the project. Our facility planners and I are available to answer questions for you. Please give me a call.Kirk Ketcherside on 02/02/2001 01:14:51 PMTo:	jfawcet@enron.comcc: Subject:	follow-up to our conversationJeff-Nice visiting with you regarding the Big Sandy project.I will look forward to seeing your new interconnect proposal next week.As for my particulars, in addition to my email address herein, pleasesee the following:Kirk KetchersideIGI Resources, Inc.7241 Sanderling CourtCarlsbad, CA 92009760/918-0001 Office phone760/918-0003 Office faxThanks and have nice weekend.Kirk","KHYATT (Non-Privileged)Hyatt, KevinDeleted Items","List(kirk, quick, estimate, cost, install, interconnect, measurement, facility, transwestern, estimate, based, original, operating, condition, phase, mmcf/d, phase, 120, mmcf/d, discussion, george, briden, subject, facility, include, tie-in, mainlines, redundancy, feature, allowing, continued, line, line, temporarily, service, maintenance, subject, type, failure, mainline, valve, turbine, meter, gallagher, flow, conditioner, telemetry, flow, control, etc, magnitude, +/-30, estimated, cost, 416,700, knowing, design, proposed, schematic, joint, operation, paso, and/or, southern, trail, estimate, include, pressure, control, and/or, pressure, equipment, consequently, project, control, pressure, lateral, system, operationally, problematic, design, pressure, flow, control, common, header, system, accepting, delivery, multiple, pipeline, operating, pressure, transwestern, amenable, project, interstate, pipeline, optimize, ultimate, design, metering, lateral, pipeline, facility, 416k, estimate, include, build, income, tax, project, reimburses, transwestern, cost, transwestern, constructing, interconnect, metering, station, transwestern, recognize, taxable, event, approx, effective, corporate, tax, rate, purpose, calculating, price, build-up, mean, avoid, tax, consequence, transwestern, agreeable, idea, project, building, metering, station, transwestern, specification, transwestern, operating, facility, operating, agreement, discus, structuring, question, project, develops, executing, project, document, provisioning, service, project, facility, planner, available, answer, question, please, call, kirk, ketcherside, igi, nctimes, net, 02/02/2001, 01:14:51, pmto, jfawcet, enron, comcc, subject, follow-up, conversationjeff-nice, visiting, regarding, sandy, project, look, forward, interconnect, proposal, week, particular, addition, email, address, herein, pleasesee, following, kirk, ketchersideigi, resource, inc., 7241, sanderling, courtcarlsbad, 92009760/918-0001, office, phone760/918-0003, office, faxthanks, nice, weekend)",List()
1832,lmfoust@aol.com,2002-02-25 08:34:45,<32751429.1075860867559.JavaMail.evans@thyme>,Fwd: FW: New Enron Logo....,"Oh goodness.....--------- Inline attachment follows ---------From: To: Wendy Raleigh , Wendy Harshbarger , Wayne Truxillo , Wayne Bockmon , Tracy Peck , Tracy Peck , Susan Lewis , Suren Terzian , Steve Swerdloff , Steve Bandor , Shirleen Glasin , Sheri Johnson , Scott Massey , Rhonda Short , Phyllis Anzalone , Peter Vint , Peter Johnston , Michelle Rob=ichaux , Michelle Foust , Maure=en Palmer , Maureen Craig , M=ark Kennedy , Marcus Dotson , Loui=sa Plotnick , Linda Gee , Kar=en Harrison , Julie Stratton , Judy Dyess , Jonathan Sewell , Jon & Charlotte Whatley , J=ohn Carr , Dick & Jonna Whatley , Diane= Swiber , Deborah Arango ,= Debbie & Robert Moser , Deb Miller , Darlene & Ed Norris , Dale Bartnick , Cris Kinsler , Craig Buehler <=a99ies@aol.com>, Chris Reedy , Bob Clifford , C. J. Kolb , Brad Coleman , Bob Schorr <21stinsight@kingwoodcable.com>, Bob Hurt , Bob Fields , Bill Kyle , Bill Jordan , Bill Heidecker , Ann Bertino , Andy Berdy , Am=y Tyndall Date: Monday, February 25, 2002 2:58:41 GMTSubject:=20","Kevin_Hyatt_Mar2002Hyatt, KevinDeleted Items","List(goodness, inline, attachment, follows, -from, newhatley, earthlink, net, wendy, raleigh, wraleigh267602mi, comcast, net, wendy, harshbarger, edhollyfield, aol, com, wayne, truxillo, wtruxillo, hotmail, com, wayne, bockmon, wbockmon, aol, com, tracy, peck, essay, houston, com, tracy, peck, essay, houston, com, susan, lewis, susanmlewis, prodigy, net, suren, terzian, sterzian, nyc, com, steve, swerdloff, steveswerd, aol, com, steve, bandor, sbandor, darden, com, shirleen, glasin, sglasin, yahoo, com, sheri, johnson, sherijohnson, clearchannel, com, scott, massey, massey, att, net, rhonda, short, fshort, houston, com, phyllis, anzalone, panzalone, houston, com, peter, vint, pvint, houston, com, peter, johnston, johnstonp, houston, com, michelle, robichaux, mrobichaux, houston, com, michelle, foust, lmfoust, aol, com, maureen, palmer, rciemian, txucom, net, maureen, craig, maureen, craig, flash, net, mark, kennedy, driscol, 1969, hotmail, com, marcus, dotson, mldot, aol, com, louisa, plotnick, louisa731, aol, com, linda, gee, linda, gee, rweamericas, com, karen, harrison, the-harrisons, houston, com, julie, stratton, djstratton, houston, com, judy, dye, jdyess, houston, com, jonathan, sewell, jonathan, sewell, artemisintl, com, jon, charlotte, whatley, jwhatley, fcgov, com, john, carr, runner3645, aol, com, dick, jonna, whatley, dj2836, aol, com, diane, swiber, djsels001, earthlink, net, deborah, arango, darango, houston, com, debbie, robert, moser, rmoser, houston, com, deb, miller, themiller, charter, net, darlene, norris, danorris, txucom, net, dale, bartnick, dale, bartnick, dot, cris, kinsler, mckinsler, yahoo, com, craig, buehler, a99ies, aol, com, chris, reedy, creedy, msn, com, bob, clifford, bobclifford2000, hotmail, com, kolb, cjk, transera-intl, com, brad, coleman, jcole47128, aol, com, bob, schorr, 21stinsight, kingwoodcable, com, bob, hurt, bobhurt66, hotmail, com, bob, field, bfields1, cfl, com, bill, kyle, bkyle, reliant, com, bill, jordan, wbjordan, reliant, com, bill, heidecker, billheidecker, hotmail, com, ann, bertino, annbertino, aol, com, andy, berdy, aberdy, aol, com, amy, tyndall, atyndall, hotmail, com, date, monday, february, 2002, 2:58:41)",List()
2032,palmannouncements.6bzlwkpg.d@insync-palm.com,2002-01-28 06:33:55,<19719873.1075860862152.JavaMail.evans@thyme>,Have the new Palm i705 in your hands tomorrow!,"When you re done being speechless you can communicate like never before. :::::::::::::::::::::::::::::::::: Dear Jess,This is it! The brand new Palm(tm) i705 wireless handheld. For just $449 plus tax, you can stay in touch, 24/7, with your work needs. As a valued Palm(tm) user, you ll get FREE overnight shipping if you nab yours before 11:59 pm PST on January 31, 2002*.That s right, you can have it in your hands tomorrow! Plus, all orders are covered by our 30-day money back guarantee!http://insync-online.p04.com/u.d?nkeWJwE5ec0rxV=90 So let s get down to the nuts and bolts of this new handsome, super-light handheld. Constantly monitoring the wireless networkto download your messages is just one way your new friend keeps you in touch with your world. Here are a few more: * Stay connected with secure instant access to business email (from Microsoft Outlook on your computer desktop), and personal email such as Earthlink, Mindspring, AOL(r), and Compuserve. * With just a few taps, access your favorite web content using Web Clipping Applications and the MyPalm portal. * Spur of the moment communication with AOL Instant Messenger(sm) service - even if you do not subscribe to AOL. * Inexpensive, unlimited-usage wireless pricing with Palm.Net(tm) service that won t empty your wallet. * Customizable message notification feature (it buzzes, vibrates, flashes, beeps...) All this means you ll be the first to be completely connected-sending and receiving secure email, AOL Instant Messages and getting one-button Internet access.** But the most interesting feature of the new \""always on\"" Palm(tm) i705 handheld is how it automatically downloads your email and AIM messages as they arrive, and lets you know when you ve been contacted. Just push the dedicated Email button and your email is already there!** ::::::::::::::::::::::::::::::::::And along with all the great Palm(tm) organization features you re used to, the new Palm(tm) i705 handheld has a bunch of other ones that make life on the go even easier: * The rechargeable lithium polymer battery lasts a full week with normal use before it needs recharging. * Ready to operate out-of-the-box means no need for complicated hardware modems, plug-ins or extra add-ons. * The SD/MMC expansion card slot lets you add memory, applications, games and even content like your favorite book. * Over $100 of free software lets you view and edit Microsoft Word, Excel and PowerPoint documents, view photographs and even play video clips. If you d like to get a closer look at this remarkable new Palm(tm)handheld, just click below for a demo. http://insync-online.p04.com/u.d?zEeWJwE5ec0rwi=180Or, if you re sold already (and who could blame you?), just click below or call 1-800-881-7256 to buy one for just $449, with a full 30-day money back guarantee.http://insync-online.p04.com/u.d?zEeWJwE5ec0rxP=110Plus, as a valued Palm(tm) user, you ll get free overnight shipping when you enter Promo Code FRSHP705 and order before 11:59 pm PST on January 31, 2002*. All in all, now s the perfect time to get connected! The Palm Team *To guarantee free overnight shipping, you must use the above Promo Code when placing your order. Orders placed prior to 11:00 a.m. PST are guaranteed to ship to arrive the next business day. Free overnight shipping offer valid only in the U.S. and while supplies last. **Internet and email access requires subscription to the Palm.net wireless service sold separately. Coverage not available in all areas.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~To unsubscribe from future email communications regarding Palm s upcoming product announcement, click below, or simply reply to this message with \""unsubscribe\"" as the subject line of the message. You may still receive other communications from Palm which you have previously requested. http://insync-palm.com/0019_unsub.dyn?i=836528970&s=ZVae(C)2002 Palm, Inc. All rights reserved. http://insync-online.p04.com/u.d?sEeWJwE5ec0rw5=130Palm.com http://insync-online.p04.com/u.d?qkeWJwE5ec0rw-=140 Palm Store http://insync-online.p04.com/u.d?kEeWJwE5ec0rwz=150InSync Online http://insync-palm.com/insync.dyn?i=836528970&s=ZVae","Kevin_Hyatt_Mar2002Hyatt, KevinDeleted Items","List(speechless, communicate, dear, jess, brand, palm, i705, wireless, handheld, 449, plus, tax, stay, touch, 24/7, valued, palm, user, free, overnight, shipping, nab, 11:59, pst, january, 2002*, hand, tomorrow, plus, covered, 30-day, money)",List()
2232,announcements.egs@enron.com,2002-03-22 11:32:05,<3856561.1075860857105.JavaMail.evans@thyme>,Steve Cooper voicemail - 03/22/2002,"Today, Friday, March 22, Steve Cooper left the following voicemail for employees. Since some employees do not have access to voicemail, we are providing the following transcript of that message. Please note that the retention and severance plan mentioned in the transcript is for the debtor companies of Enron.You can also access this message and past messages at: http://home.enron.com/updates/.***********************************************************************************Steve CooperVoicemail MessageDistributed Friday, March 22, 2002Hi everybody, it s Steve Cooper. It s Thursday evening, March 21. I m sorry I haven t gotten to you earlier this week, but we ve been finalizing and dotting the I s and crossing the T s on the retention and severance plan. It s that I want to update you on.We ve continued to present and represent the plan to the Creditors Committee, and now they are comfortable with it and in support of it, which is all good news. In order to be able to motion this up to the court, we are in the process of finalizing all the various documentation necessary -- the plan itself, the motions, the applications so on and so forth. I am relatively confident that we will have this motioned up next week sometime. This has gotten pushed back a number of times, but I think we now have it on track and hopefully next week will be the week.I do apologize for the way this thing has gotten delayed. Now that I have worked with a number of you over the last month or two, I see how quickly you move and how quickly you are used to moving - so, I ve got to believe that this has been incredibility frustrating for all of you. Once we get this motioned up, there is a notice and objection period before the court order is approved and the plan goes active. I can t guarantee you that it s going to go active. But, I m pretty confident - in fact, I m highly confident - that with the endorsement of the Creditors Committee, the court will approve it.However, we are not going to wait until we get approval to share the details of the plan with you. We will put a process in place so that over the next week or so we will hold one-on-one meetings with all of you to discuss the individual participation and what these plans are going to mean to each and every one of you to eliminate to the greatest extent possible the uncertainty that everyone has been living with over the last several months.The plan is retroactive to March 1, regardless of when the court approves it, so I want everybody to know that will be the start date.On another subject, the merit increases and promotions for employees in the debtor companies have been finalized and communicated, with the possible exception of some MD s and VP s. Hopefully, you have all been notified about those promotions.On the merit increase side, I believe they have all been finalized for all employees below the vice president level. Hopefully, by now you have received information as to what those increases are. If you haven t, I would encourage you to speak to whoever you report to.Everything else is moving forward. I think we now have the employees meeting scheduled for April 11, and I m looking forward to seeing and meeting more of you at that get together. We are on target for pulling together a longer-term business plan for the company, and hopefully will be able to describe it to you in some detail on April 11. We re exchanging information and communicating with the creditors on a more regular and steady basis. Those relationships are beginning to settle down. I think there is actually a lot of good news that I can impart to you over the next several weeks.I want to thank everybody for your continued efforts and understanding. Again, I want to apologize for the way this thing has dragged, but I can assure you that Jeff McMahon, the Human Resource people, and myself have all been working like busy bees to get this program finalized and in place. Knock on wood next week is the week.Everybody have a great evening. If you have any concerns, issues, or problems, please don t hesitate to give me a buzz or an Email and I will get back to each and every one of you as soon as I can. Have a great night and thanks again for all your support, help and dedication to the ongoing Enron organization.","Kevin_Hyatt_Mar2002Hyatt, KevinCorp Memos","List(friday, march, steve, cooper, left, following, voicemail, employee, employee, access, voicemail, providing, following, transcript, message, please, note, retention, severance, plan, mentioned, transcript, debtor, company, enron, access, message, past)",List()
2432,sdimitroff@kslaw.com,2001-09-26 16:09:52,<17027039.1075860852222.JavaMail.evans@thyme>,FW: US response,"Sashe D. DimitroffKing & Spalding1100 Louisiana, Ste. 3300Houston, Texas 77002(713) 751-3229 (direct)(713) 751-3290 (fax)sdimitroff@kslaw.com> -----Original Message-----> From:	Correll, Charles > Sent:	Wednesday, September 26, 2001 6:08 PM> To:	Dimitroff, Sashe> Subject:	FW: US response> > > > -----Original Message-----> From:	jfarley@reliant.com [mailto:jfarley@reliant.com]> Sent:	Wednesday, September 26, 2001 12:35 PM> To:	pierce_scott@jpmorgan.com; Anthony R. Ierardi; tjl334@yahoo.com;> indysq5@yahoo.com; jim_wilon@hotmail.com; Correll, Charles;> mclaugh4@mit.edu> Subject:	US response> > <> Need to add a Cav Regiment to the attached picture...> > > (See attached file: WonTheTo.jpg) Confidentiality NoticeThis message is being sent by or on behalf of a lawyer. It is intended exclusively for the individual or entity to which it is addressed. This communication may contain information that is proprietary, privileged or confidential or otherwise legally exempt from disclosure. If you are not the named addressee, you are not authorized to read, print, retain, copy or disseminate this message or any part of it. If you have received this message in error, please notify the sender immediately by e-mail and delete all copies of the message.","Kevin_Hyatt_Mar2002Hyatt, KevinPersonal","List(sashe, dimitroffking, spalding1100, louisiana, ste, 3300houston, texas, 77002, 713, 751-3229, direct, 713, 751-3290, fax, sdimitroff, kslaw, com, -original, message, correll, charles, sent, wednesday, september, 2001, 6:08, dimitroff, sashe, subject, response, -original, message, jfarley, reliant, com, mailto, jfarley, reliant, com, sent, wednesday, september, 2001, 12:35, pierce, scott, jpmorgan, com, anthony, ierardi, tjl334, yahoo, com, indysq5, yahoo, com, jim, wilon, hotmail, com, correll, charles, mclaugh4, mit, edu, subject, response, wontheto, jpg, add, cav, regiment, attached, picture, attached, file, wontheto, jpg, confidentiality, noticethis, message, sent, behalf, lawyer, intended, exclusively, individual, entity, addressed, communication, contain, information, proprietary, privileged, confidential, otherwise, legally, exempt, disclosure, named, addressee, authorized, read, print, retain, copy, disseminate, message, received, message, error, please, notify, sender, immediately, e-mail, delete_copy_message)",List(delete_copy_message)
3032,james.steffes@enron.com,2001-03-15 00:17:00,<14435011.1075860464403.JavaMail.evans@thyme>,Re: FERC Jurisdiction Over California Investigations,"Mary --I don t follow the question of RFP and the attached memo.Jim	Mary Hain@ECT	03/14/2001 03:45 PM To: \""Ronald Carroll\"" @ ENRON cc: James D Steffes/NA/Enron@Enron, Joe Hartsoe@Enron Subject: Re: FERC Jurisdiction Over California InvestigationsIt s been a few years since I reviewed the case law on this. Perhaps you should write a memo reviewing the FERC cases on this issue and if necessary, go talk to FERC about whether an RFP will be sufficient to set a just and reasonable rate. Is that okay with you Jim? Ron - how much would that cost? Enron Capital & Trade Resources Corp. From: \""Ronald Carroll\"" 03/14/2001 01:10 PM	To: \""Jeffrey Watkiss\"" , , , , cc: , Subject: FERC Jurisdiction Over California InvestigationsRichard: In connection with EPMI s contention in the various California litigations that they should be dismissed due to FERC s primary jurisdiction, it strikes me that it may be helpful to lodge FERC s March 9, 2001 order with the Court. While FERC, in the 12/15 order, established its investigation, the March 9 order makes findings and imposes remedies (fortunately not against us). This should enhance the primary jurisdiction argument. FERC s intent to occupy the field could not be more clear. Ron",Mary_Hain_Aug2000_Jul2001Notes FoldersNotes inbox,"List(mary, don, follow, question, rfp, attached, memo, jim, mary, hain, ect, 03/14/2001, 03:45, ronald, carroll, rcarroll, bracepatt, com, enron, james, steffes/na/enron, enron, joe, hartsoe, enron, subject, ferc, jurisdiction, california, investigationsit, reviewed, law, write, memo, reviewing, ferc, issue, talk, ferc, rfp, sufficient, set, reasonable, rate, okay, jim, ron, cost, enron_capital, trade_resource, corp., ronald, carroll, rcarroll, bracepatt, com, 03/14/2001, 01:10, jeffrey, watkiss, dwatkiss, bracepatt, com, gfergus, brobeck, com, jsteffe, enron, com, rsanders, enron, com, sbishop, gibbs-bruns, com, mary, hain, enron, com, smara, enron, com, subject, ferc, jurisdiction, california, investigationsrichard, connection, epmi, contention, various, california, litigation, dismissed, due, ferc, primary, jurisdiction, strike, helpful, lodge, ferc, march, 2001, court, ferc, 12/15, established, investigation, march, make, finding, imposes, remedy, fortunately, enhance, primary, jurisdiction, argument, ferc, intent, occupy, field)","List(enron_capital, trade_resource)"
3232,susan.mara@enron.com,2001-03-21 08:12:00,<26176944.1075860455989.JavaMail.evans@thyme>,State Controller Kathleen Connell Press Conference,"A report from ARM s PR firm to the ARM members.Sue MaraEnron Corp.Tel: (415) 782-7802Fax:(415) 782-7854----- Forwarded by Susan J Mara/NA/Enron on 03/21/2001 04:12 PM -----	\""Beiser, Megan\"" 03/21/2001 03:00 PM To: \""Aaron Thomas (E-mail) (E-mail)\"" , \""Andrea Weller (E-mail) (E-mail)\"" , \""andrew Chau (E-mail) (E-mail)\"" , \""Bill Chen (E-mail) (E-mail)\"" , \""Douglas Oglesby (E-mail) (E-mail)\"" , \""Fairchild, Tracy\"" , \""Jeffrey Hanson (E-mail) (E-mail)\"" , \""jennifer Chamberlin (E-mail) (E-mail)\"" , \""john Barthrop (E-mail) (E-mail)\"" , \""John Leslie (E-mail) (E-mail)\"" , \""Joseph Alamo (E-mail) (E-mail)\"" , \""Manuel, Erica\"" , \"" Michael Nelson (E-mail)\"" , \""Peter Bray (E-mail) (E-mail)\"" , \""Rebecca Schlanert (E-mail) (E-mail)\"" , \""Richard Counihan (E-mail) (E-mail)\"" , \"" Robert Morgan (E-mail)\"" , \""Sue Mara (E-mail) (E-mail)\"" , \""Allen, Stevan\"" , arm@phaser.com, \""brbarkovich@earthlink.net\"" , cra@calretailers.com, dennis.flatt@kp.org, dhunter@smithandkempton.com, djsmith@smithandkempton.com, Dominic.DiMare@calchamber.com, drothrock@cmta.net, gharrison@calstate.edu, hgovenar@govadv.com, jackson_gualco@gualcogroup.com, ken_pietrelli@ocli.com, kgough@calpine.com, kmccrea@sablaw.com, kmills@cfbf.com, lhastings@cagrocers.com, mday@gmssr.com, mmoretti@calhealth.org, nplotkin@tfglobby.com, randy_britt@robinsonsmay.com, richard.seguin@kp.org, RochmanM@spurr.org, rrichter@calhealth.org, sgovenar@govadv.com, smccubbi@enron.com, spahnn@hnks.com, theo@ppallc.com, vincent.stewart@ucop.edu, vjw@ceert.org, \""Warner, Jami\"" , wbooth@booth-law.com, wbrown@lhom.com, wlarson@calstate.edu cc: Subject: State Controller Kathleen Connell Press Conference> State Controller Kathleen Connell held a press conference today to voice> her concerns over electricity power purchases effect on the State s> general fund and in response to a letter received March 12 from Department> of Finance Director Tim Gage and Chief Legislative Analyst Elizabeth Hill> requesting a transfer of money from the General Fund.> > Connell stated that Current financial information related to the purchase> of electricity and the general fund is as follows:> *State s general fund surplus has dropped from $8.5 billion in January to> a current estimated level of $3.2 billion> *Receipt of a letter from the Department of Finance Director and the Chief> Legislative Analyst requesting an additional transfer of $5.6 billion to> the Special Fund for Economic Uncertainties to cover power purchases> *To cover this transfer, Connell said the state would have to borrow $2.4> billion> *Connell: \""We started this year with a generous budget surplus; the energy> crisis has taken much of that away, and this transfer on top of the> electricity purchases would put the [General] fund at risk.\""> *Debt issuance has not occurred to reimburse the General Fund for power> purchases, while disbursements from the General Fund increase daily> > Controller Connell sent a letter to the Governor today, calling for the> following steps to be taken:> 1. DWR to notify the Controller s office of any purchases made and any> contracts negotiated to date> 2. DWR to notify the Controller s office of any future purchases and> contracts within 7 days, regardless of when the invoices are submitted> 3. Information on purchases in excess of $55 million should be submitted> within 24 hours> 4. DWR should prepare new General Fund cash flow estimates for the next 30> and 60 days, and for the end of the fiscal year> 5. DWR should take action to ensure that bond sales are completed by the> end of May 2001> > The Controller is also ordering an audit of the DWR resources to determine> the amount of money being spent by the Department. Currently, Connell> said, \""I have to rely on press reports as valid to determine the amount of> money spent in power purchases.\"" She said she needs \""acknowledgement of> the total amount of liabilities made\"" by DWR. > > Megan Beiser> Assistant Account Executive> Edelman Public Relations Worldwide, Sacramento> Phone: (916) 442-2331> Fax: (916) 447-8509>",Mary_Hain_Aug2000_Jul2001Notes FoldersNotes inbox,"List(report, arm, firm, arm, sue, maraenron, corp., tel, 415, 782-7802fax, 415, 782-7854, forwarded, susan, mara/na/enron, 03/21/2001, 04:12, beiser, megan, megan, beiser, edelman, com, 03/21/2001, 03:00, aaron, thomas, e-mail, e-mail, athomas, newenergy, com, andrea, weller, e-mail, e-mail, aweller, sel, com, andrew, chau, e-mail, e-mail, anchau, shellus, com, bill, chen, e-mail, e-mail, bchen, newenergy, com, douglas, oglesby, e-mail, e-mail, doao, chevron, com, fairchild, tracy, tracy, fairchild, edelman, com, jeffrey, hanson, e-mail, e-mail, jeff, hanson, phaser, com, jennifer, chamberlin, e-mail, e-mail, jnnc, chevron, com, john, barthrop, e-mail, e-mail, jbarthrop, electric, com, john, leslie, e-mail, e-mail, jleslie, luce, com, joseph, alamo, e-mail, e-mail, jalamo, enron, com, manuel, erica, erica, manuel, edelman, com, michael, nelson, e-mail, mnelson, electric, com, peter, bray, e-mail, e-mail, pbray, newpower, com, rebecca, schlanert, e-mail, e-mail, rschlanert, electric, com, richard, counihan, e-mail, e-mail, rick, counihan, greenmountain, com, robert, morgan, e-mail, rmorgan, newenergy, com, sue, mara, e-mail, e-mail, smara, enron, com, allen, stevan, stevan, allen, edelman, com, arm, phaser, com, brbarkovich, earthlink, net, bbarkovich, earthlink, net, cra, calretailers, com, dennis, flatt, org, dhunter, smithandkempton, com, djsmith, smithandkempton, com, dominic, dimare, calchamber, com, drothrock, cmta, net, gharrison, calstate, edu, hgovenar, govadv, com, jackson, gualco, gualcogroup, com, ken, pietrelli, ocli, com, kgough, calpine, com, kmccrea, sablaw, com, kmills, cfbf, com, lhastings, cagrocers, com, mday, gmssr, com, mmoretti, calhealth, org, nplotkin, tfglobby, com, randy, britt, robinsonsmay, com, richard, seguin, org, rochmanm, spurr, org, rrichter, calhealth, org, sgovenar, govadv, com, smccubbi, enron, com, spahnn, hnks, com, theo, ppallc, com, vincent, stewart, ucop, edu, vjw, ceert, org, warner, jami, jami, warner, edelman, com, wbooth, booth-law, com, wbrown, lhom, com, wlarson, calstate, edu, subject, controller, kathleen, connell, press_conference, controller, kathleen, connell, held, press_conference, voice, concern, electricity, power, purchase, effect, fund, response, letter, received, march, department, finance, director, tim, gage, chief, legislative, analyst, elizabeth, hill, requesting, transfer, money_fund, connell, stated, current, financial, information, related, purchase, electricity, fund, follows, *state, fund, surplus, dropped, billion, january, current, estimated, level, billion, *receipt, letter, department, finance, director, chief, legislative, analyst, requesting, additional, transfer, billion, special, fund, economic, uncertainty, cover, power, purchase, *to, cover, transfer, connell, borrow, billion, *connell, started, generous, budget, surplus, energy, crisis, transfer, top, electricity, purchase, fund, risk, *debt, issuance, occurred, reimburse, fund, power, purchase, disbursement, fund, increase, daily, controller, connell, sent, letter, governor, calling, following, step, dwr, notify, controller, office, purchase, contract, negotiated, date, dwr, notify, controller, office, future, purchase, contract, day, regardless, invoice, submitted, information, purchase, excess, million, submitted, hour, dwr, prepare, fund, cash_flow, estimate, day, fiscal, dwr, action, ensure, bond, sale, completed, 2001, controller, audit, dwr, resource, determine, amount, money, spent, department, currently, connell, rely, press, report, valid, determine, amount, money, spent, power, purchases., acknowledgement, total, amount, liability, dwr, megan, beiser, assistant, account, executive, edelman, public, relation, worldwide, sacramento, phone, 916, 442-2331, fax, 916)","List(press_conference, money_fund, cash_flow)"

Hub,High,Low,Wtd Avg Index,Change ($),Vol (Mwh) = = = = = = = = = = = = =20
Cinergy,$21.75,$20.75,= $21.14,+ = .52,43200
Comed,$20.25,$20.25,= $20.25,- = .37,"3,200="
Entergy,$23.25,$22.00,= $22.44,+ = .80,32000
Nepool,$28.50,$27.00,= $27.86,+ = .09,"8,800="
Palo Verde,$25.00,$22.50,$23.42,- 3.95,4400
PJM-West,$24.60,$24.20,= $24.39,- = .79,75200
SP-15,$25.00,$24.00,= $24.49,- =3.62,"6,000="
TVA,<=FONT size=3D-1> $22.00,$21.00,= $21.69,+ .9=4,6400


In [3]:
# Create a view or table
# Change the subject text to lower case and the timestamp to a date whereby we drop the hours and minutes such
# that the format of the timestamp data is now 'YYYY-MM-DD'  -- This could be an issue as we lost hh-mm but it was only way 
# for me to get generic filter my dataframe function working to get emails by date --YYYY-MM-DD
# May have to revisit
# 
from pyspark.sql.functions import lower, col, date_format
#from pyspark.sql.functions import *
# 
df = df.select("mid","sender", date_format(col("timestamp"),'yyyy-MM-dd').alias("timestamp"), \
               "omid",lower(col('subject')).alias('subject'), "body", "folder","body_cleaned", "concepts")

temp_table_name = 'df'
df.createOrReplaceTempView('df')
#display(df)
df.count()


In [4]:
%sql

/* Query the created temp table in a SQL cell  Left here as an example of code syntax */

select count(*) from `df`

count(1)
6306


In [5]:
# df1 and df2 are temp dataframes containing subject with replies and forwards as it was thought
# that it might be a good idea to create a dataframe where we eliminate those emails
# no_for_or_replies is the final df that eliminated the forwards and replies
df1 = df.filter(df['subject'].startswith("re:"))
df2 = df.filter(df['subject'].startswith("fw:"))
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')

no_for_or_replies_df = spark.sql("select * from df minus (select * from df1 union select * from df2)")
display(no_for_or_replies_df)
no_for_or_replies_df.createOrReplaceTempView('no_for_or_replies_df')


In [6]:
%sql


select count(*) from no_for_or_replies_df

count(1)
3922


In [7]:
# This creates a dataframe (only_bigrams_df) from no_for_or_replies_df above where we reduce further to eliminate any emails that do not have a bigram from 
# the concepts column
import pyspark.sql.functions as F
only_bigrams_df=spark.sql("SELECT * from no_for_or_replies_df").filter(F.size('concepts') > 0)
only_bigrams_df.createOrReplaceTempView('only_bigrams_df')  

In [8]:
%sql
select count(*) from only_bigrams_df

count(1)
1250


In [9]:
# Below is a function that is meant to clean the enron email data from text that is not useful to summarization -- thanks Marta
# I attempted to remove all puncuation except for periods but it wasnt working  -- Revisit
# We should all add to clean function as source to pre-process data
import re, html, string
rem = ['(?s)<TYPE>GRAPHIC.*?</TEXT>',
'(?s)<TYPE>EXCEL.*?</TEXT> ',
'(?s)<TYPE>PDF.*?</TEXT>',
'(?s)<TYPE>ZIP.*?</TEXT>',
'(?s)<TYPE>COVER.*?</TEXT>',
'(?s)<TYPE>CORRESP.*?</TEXT>',
'(?s)<TYPE>EX-10[01].INS.*?</TEXT>',
'(?s)<TYPE>EX-99.SDR [KL].INS.*?</TEXT>',
'(?s)<TYPE>EX-10[01].SCH.*?</TEXT>',
'(?s)<TYPE>EX-99.SDR [KL].SCH.*?</TEXT>',
'(?s)<TYPE>EX-10[01].CAL.*?</TEXT>',
'(?s)<TYPE>EX-99.SDR [KL].CAL.*?</TEXT>',
'(?s)<TYPE>EX-10[01].DEF.*?</TEXT>',
'(?s)<TYPE>EX-99.SDR [KL].LAB.*?</TEXT>',
'(?s)<TYPE>EX-10[01].LAB.*?</TEXT>',
'(?s)<TYPE>EX-99.SDR [KL].LAB.*?</TEXT>',
'(?s)<TYPE>EX-10[01].PRE.*?</TEXT>',
'(?s)<TYPE>EX-99.SDR [KL].PRE.*?</TEXT>',
'(?s)<TYPE>EX-10[01].REF.*?</TEXT>',
'(?s)<TYPE>XML.*?</TEXT>',
'<TYPE>.*',
'<SEQUENCE>.*',
'<FILENAME>.*',
'<DESCRIPTION>.*',
'(?s)(?i)<Head>.*?</Head>',
'(?s)(?i)<Table.*?</Table>',
'(?s)<[^>]*>']
#
def clean(txt):
#  txt=txt.replace (".",". ")
  doc = re.sub("\xa0|\n|\t|—|_"," ",html.unescape(txt))
  remove = string.punctuation
  remove = remove.replace(".", "") # don't remove periods
  pattern = r"[{}]".format(remove) # create the pattern
#  bdm_doc=re.sub(pattern, "", txt) 
#  return re.sub("(?s) +"," ",re.sub(rem[-1]," ",bdm_doc))
  return re.sub("(?s) +"," ",re.sub(rem[-1]," ",doc))
#
def add_space(s):
    res = re.sub('\s+$', '', re.sub('\s+', ' ', re.sub('\.', '. ', s)))
    if res[-1] != '.':
       res += '.'
    return res
def remove_punct(s):
  remove=string.punctuation
  remove=remove+'\0123456789[]'
  s.translate(None, remove)
#def remove_punct(s):
#  remove = string.punctuation
#  pattern = r"[{}]".format(remove) # create the pattern
#  return re.sub(pattern, "", s) 

In [10]:
#Generic function to filter a dataframe based on passing a dictionary 
#consisting of key - pairs as {column_name:column_value}
#This functions also requires that the name of the dataframe is passed in as well
#Could not get sql to be truly dynamic due to the sql with argument requiring a select_string.format syntax as the .format part
#is not allowed to be a string (Looked up eval, exec, set_attrib, etc) None of these would work.
#This function requires a temporary table as the dataframe and can only have a max of 3 items in dictionary and thus,
#only 3 values chanied and filtered in the where clause.
 
def filter_my_dataframe (my_dict, my_df):
#
  my_counter=0
  my_query='select * from ' + my_df + ' where '
#
# Build the select and where clause string
#
  for my_key in test_dict.keys():
    my_counter+=1
    if my_counter==1:
      my_query=my_query+my_key+'="{}"'
    else:
      my_query=my_query+' and '+my_key+'="{}"'
#  
  format_str='.format('
  my_counter=0
#
# Determine the number of passed values -- max is 3
#
  for my_value in test_dict.values():
    my_counter+=1
    if my_counter==1:
      my_first_val=my_value
    else:
      if my_counter==2:
        my_second_val=my_value
      else:
        if my_counter==3:
          my_third_val=my_value
#
# Build the format that requires appending to the select/where string as 'sql_string'.format
#
  if my_counter==1:
    my_query=my_query.format(my_first_val)
  if my_counter==2:
    my_query=my_query.format(my_first_val, my_second_val)
  if my_counter==3:
    my_query=my_query.format(my_first_val,my_second_val, my_third_val)
#
  return (sqlContext.sql(my_query))

In [11]:
# The nltk_text_summarization function was written from code set in article (https://stackabuse.com/text-summarization-with-nltk-in-python/)
# This function will return n sentences that are weighted with frequently used words for each email as an rdd/dataframe to be provided as an argument to the function
#
import nltk
import heapq
from nltk.tokenize import sent_tokenize,word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
def nltk_text_summarization(my_email, my_num_of_summary_sentences):
  try:
#
# Use nltk to tokenize the sentences. Initialize the stop words and dictionat for word_frequencies
#
    sentence_list = nltk.sent_tokenize(my_email)
    stopwords = nltk.corpus.stopwords.words('english')
    word_frequencies = {}  
# 
# Load the words from email into dictionary and initialize to 1 and keep running count as value
#
    for word in nltk.word_tokenize(my_email):
      if word not in stopwords:
        if word not in word_frequencies.keys():
          word_frequencies[word] = 1
        else:
          word_frequencies[word] += 1
#
# weight the word counts by dividing by the maximum word count found in the email
#
    maximum_frequency = max(word_frequencies.values())
    for word in word_frequencies.keys():
      word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
#
# Find the sentences and add word frequency weights to the sentences in a dictionary structure.
# This will identify the sentences that have the highest word weights that will be used for the text summary
#
    sentence_scores = {}
    for sent in sentence_list:
      for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
          if len(sent.split(' ')) < 30:
            if sent not in sentence_scores.keys():
              sentence_scores[sent] = word_frequencies[word]
            else:
              sentence_scores[sent] += word_frequencies[word]
# 
# Store the count of the number of sentences in the email 
# If the sentence length is < 3 , then use all sentences for the text summary
# Else capture the top N sentences specified in the call to function to be used for the text summary
#
    the_length=len(sentence_scores)
    if the_length < my_num_of_summary_sentences:
      summary_sentences = heapq.nlargest(the_length, sentence_scores, key=sentence_scores.get)
    else:
      summary_sentences = heapq.nlargest(my_num_of_summary_sentences, sentence_scores, key=sentence_scores.get)
    my_summary = ' '.join(summary_sentences)
    return(my_summary)
  except:
    pass

In [12]:
num_of_summary_sentences=3
my_nltk_df = no_for_or_replies_df.select('mid','body').rdd.map(lambda x:(x[0],nltk_text_summarization(clean(str(x[1]).replace("."," ")),num_of_summary_sentences))) \
           .filter(lambda x: x[1]).filter(lambda x: x[1] is not None)
#big_test.collect()
#   .filter(lambda x: x).filter(lambda x: x is not None)
my_nltk_df1 = my_nltk_df.toDF(['mid','summary'])
my_nltk_df1.createOrReplaceTempView('my_nltk_df1')  

In [13]:
num_of_summary_sentences=3
my_nltk_df = no_for_or_replies_df.select('mid','body').rdd.map(lambda x:(x[0],nltk_text_summarization(clean(str(x[1])),num_of_summary_sentences))) \
           .filter(lambda x: x[1]).filter(lambda x: x[1] is not None)
#big_test.collect()
#   .filter(lambda x: x).filter(lambda x: x is not None)
my_nltk_df1 = my_nltk_df.toDF(['mid','summary'])
my_nltk_df1.createOrReplaceTempView('my_nltk_df1')    

In [14]:
%sql select * from my_nltk_df1

mid,summary
307235,"Box 1188Houston, Texas 77251-1188Dear Dr. KaminskiMarch 12, 2001I am writing to formalize your invitation to attend, participate, and speakin the SIAM Southwest Regional Mathematics in Industry Workshop. Additionally theevent will focus upon the mechanisms facilitating interaction andcollaboration between the academy, industry, and government laboratories.The workshop will be held at the University of Houston Hilton Hotel, April27-28. Instead themeeting will emphasize the mathematics and technology currently applied tothe projects of industry and governmental laboratories."
333234,"I also made a change to May. I was calculating the differential off of the wrong location index.DG Drew,Here is Agave for June."
384232,are you coming to the uh/tx game on 9/23 - is donnita?
126632,Thanks.
108233,You can check the progress of your request by clicking http://itcapps.corp.enron.com/srrs/auth/emailLink.asp?ID=000000000020857&Page=MyReq. You will be notified by email when your request has been processed. Thank you for your request.
360634,"Rita, In order to book the imbalance payback between HPL and Lobo, I have created deal 1166748 (ENA selling to AEP @ $2.45). Following that, we sold the gas to the central desk, who sold it to Oneok at the South Texas pool. With the changes I made and the entries you are able to do, hopefully this deal will be wrapped up."
213032,"?$684.65 [IMAGE][IMAGE]HipZip MP3 Player Kit USB w/2 40MB Disk, Head Phones By Iomega Get a $50 Rebate direct from Iomega from February 17, 2001 through June 30, 2001. ?$135.27 [IMAGE][IMAGE]Stylus Color 777 Ink Jet 2880x720 12ppm USB By Epson The EPSON Stylus Color 777 ink jet printer delivers remarkable print speed, superb output and outstanding value. ?$505.64 [IMAGE][IMAGE]ML591 24-Pin 120V 360CPS Wide Parallel By Okidata Microline 590 and 591 OKISMART Paper Handling for hassle-free, reliable switching between different media."
109631,"> Tracy Ross, Counsel, Royal Bank of CanadaPhone: 416-974-5503; Fax: 416-974-2217File - This email may be privileged and confidential.? Any dissemination or use of this information by a person other than the intended recipient(s) is unauthorized.? Attached is the language that is necessary to make sure the CSA is one way.?"
247831,Send to SSchneider@aep.com with a copy to MLCarriere@aep.com. Please forward your altered draft back to me when completed.
272832,"----------------------------------------------------------------------------------This message and any attachments (the \""message\"") are intended solely for the addressees and are confidential. X-FileName: dutch quigley 6-26-02.PST(See attached file: Delta11-28.xls) Ce message et toutes les pieces jointes (ci-apres le \""message\"") sont etablis a l intention exclusive de ses destinataires et sont confidentiels. BNP PARIBAS (and its subsidiaries) shall (will) not therefore be liable for the message if modified."


In [15]:
# Below shows a test of a generic function created to filter a dataframe.  
# The function filter_my_dataframe can be used to compund your where clause via a dictionary you create for your columns in the where clause
# The function dynamically builds the where clause via parameters passed via dictionary for where clause. The second parameter is the name of the dataframe
# Belows test grabs an email via message id , mid, but can also be used to grab a certain date as well as a spcific sender
#
#test_dict={'mid':404632}
test_dict={'mid':68633}
#test_dict={'sender':'ken.skilling@enron.com'}
test_df='df'
test_a_row=filter_my_dataframe(test_dict,test_df)
#type(test_a_row)
t=test_a_row.select('body').collect()
#t=test_a_row.select 'body'.rdd
#mi=nltk_text_summarization(clean(str(t)))
#my_weight_frequency(mi)
#nltk_text_summarization(clean(str(t)))
sum_sentences=nltk_text_summarization(clean(str(t)),3)
print(sum_sentences)


In [16]:
%sql select body from no_for_or_replies_df where mid=68633

body
"Following our announcement of an additional of $1 billion credit line, Standard & Poor s (S&P) today downgraded Enron s long-term credit rating one notch from BBB+ to BBB and short-term rating from A2 to A3. We expected this, because it is not unusual to be downgraded after using assets to secure credit. This is still above investment grade.The ratings of our pipelines Northern Natural Gas and Transwestern have also been lowered from A- to BBB. In S&P s words, \""Their ratings [are now] in line with those of the parent company to reflect S&P s view that Enron s pipeline assets have become more strategic to the company.\""S&P also said, \""[We continue] to believe that Enron s liquidity position is adequate to see the company through the current period of uncertainty, and that the company is working to provide itself with an even greater liquidity cushion through additional bank lines and pending asset sales.\""As I ve said before, building on our liquidity position through additional credit lines maintains our counterparties confidence and strengthens our core businesses.It s important for you to know that our gas and power numbers - which account for more than 95 percent of our trading activity - indicate that our customer base is not withdrawing, closing out positions, or reducing transaction levels as a result of credit concerns. In fact, EnronOnline trading volumes are currently experiencing above-normal activity.We will continue to update you as new developments arise. Thank you."


In [17]:
from gensim.summarization.summarizer import summarize
import re
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize


def short(sent):
  try:
    return summarize(sent)
  except:
    pass


In [18]:
body_clean = no_for_or_replies_df.select('mid','body') \
   .rdd.map(lambda x:(x[0],short(re.sub(r'[^a-zA-Z\s\.]', ' ',clean(str(x[1])))))) \
   .filter(lambda x: x[1]).filter(lambda x: x[1] is not None)
body_clean1 = body_clean.toDF(['mid','summary'])
body_clean1.createOrReplaceTempView('body_clean1')  

In [19]:
%sql

SELECT * FROM body_clean1 where summary <> "";

mid,summary
307235,KaminskiEnronP.O. Box Houston Texas Dear Dr. KaminskiMarch I am writing to formalize your invitation to attend participate and speakin the SIAM Southwest Regional Mathematics in Industry Workshop. The workshop funded under the auspicesof a National Science Foundation grant to SIAM will not be a standardapplied mathematics event with representatives from industry academe andgovernmental agencies presenting their latest research results.
360634,. From what I understand we received the gas at King Ranch sold it to AEP so that they would move it through the header on their contract and bought it back at a half cent difference.
213032,IMAGE June Can t read this email Click hereIssue e PROVANTAGE Customer jeff dasovich enron.comTo unsubscribe from the Original Advantage Click here Do Not Reply to this email Products that give you the Professional Advantage IMAGE Crystal Reports v . IMAGE IMAGE PhotoSmart xi Digital Camera . IMAGE IMAGE MB Smart Media Card . IMAGE IMAGE PYRO CardBus DV for Notebooks By ADS Technologies Add IEEE ports to your Notebook Capture and Edit Digital Video Transfer data at Mb Sec. Connect to Hard Drives Printers Scanners ...More . GB External FireWire Dr By QPS Inc. The Que DVD RAM Fire Drive is ideal for desktop publishing archiving and presentations. IMAGE Web Address www.PROVANTAGE.com Toll Free Fax email sales provantage.com Privacy Policy Terms Conditions FREE Catalog PROVANTAGE Corporation Whipple Ave. NW North Canton OH Products prices terms conditions or offers may change at any time. The Original Advantage promotional email is delivered only to customers of PROVANTAGE Corporation. PROVANTAGE customers have purchased products in the past and submitted their email address as part of the checkout process.
272832,X FileName dutch quigley .PST See attached file Delta .xls Ce message et toutes les pieces jointes ci apres le message sont etablis a l intention exclusive de ses destinataires et sont confidentiels.
72631,We then sell that displaced power on the wholesale market where the prices are high and share the revenue with the customer.
39035,Topock is a more liquid market because of the upstream transport contracts on EPNG and downstream buying patterns of certain Socal markets. Picture the market San Juan gas is trading well below the Permian and Waha basins which happen to be the marginal supply points for the west. Because of the of the limited capacity going into Socal Topock every transporter on EPNG and TW wants to fill their contracts with San Juan gas first. This transport leg simply displaces Permian gas going into Plains North and allows Permian gas to head west to Ehrenberg on the EPNG south mainline. When demand exceeds all available historical supply points and begins to reach to Waha to meet Ehrenberg demand you have a paradigm shift. When Keystone West reaches maximum capacity the rule of market spreads not exceeding the max rate transport rate then dies a fast death. There is simply no available capacity at the Keystone West meter to allow a shipper to buy max rate transport on EPNG and ship it to Socal. The supply exists but there isn t enough refining capacity available to meet demand.
127431,The price does not include line losses which the Company agreed to throw in if another marketer signs on none did The market price is also the shopping credit. The shopping credit for residential and commercial customers is low but could possibly work pending analysis from structuring. . Customers on special contracts approximately of them can during the first year either opt to cancel their contract or extend it through the RTC recovery period which is unlawful in my view although that observation did not slow anyone down. . The settlement creates an implicit cap on shopping of since the Company will seek to have the incentive removed once the is reached. billion without quantification or review are Consumers Counsel Staff Industrial Users Retail Merchants Low income some Ohio Manufacturers Shell Energy Kroger AK Steel Ohio Hospital Association.Those fighting the stipulation are all the other marketers the Cleveland Growth Association the City of Cleveland Citizens Action Citizens Power Safe Energy Communications Council.By separate e mail Laurie Knight will send you the shopping credit and shopping incentive numbers.
348232,I have released the Boston Gas capacity on Tennessee contract and Iroquois contract to Boston Gas non recallable subject to bid per Boston Gas s request for Feb st for one month only.Other transport notes To serve the Lilco deal we need supply on Texas Eastern Tennessee and Transco.Texas Eastern is trying to decide how to handle the cash out exposure.
207431,Then ifyou are into it at all I will look into flight opportunities staying downthere and the scuba course and single dives are all very cheap but theflight would be the most expensive .
110631,Tan and Susan can one of you handle this I called Diane Anderson and told her that I ll be leaving Enron and won t be able to help her. In order to fill in the blanks in the first paragraphof Carol s template it is necessary for me to know the appropriate section in each counterparty s Agreement that references the existing confirmation procedures the Confirmation Procedures .Would it be possible for you to provide me with copies of the Master Agreementsfor each of the following counterparties so I can fill in the blanks and get these lettersout the door Bank One formerly FNB Bankers Trust CompanyBarclays Bank PLCThe Chase Manhattan BankCitibank N.A.Credit Suisse Financial ProductsElf Trading S.AJ.Aron CompanyParibasPhibro Inc.Royal Bank of CanadaI would be happy to come up and retrieve them as soon as you have them ready.If you have any questions please give me or Joe a call.Thanks we re looking forward to being able to implement this procedure.


In [20]:
df_lda = spark.sql("select a.*, b.summary from no_for_or_replies_df a, body_clean1 b \
where a.mid=b.mid and b.summary<>\'\'").rdd
df_lda.take(5)

In [21]:
df_lda = spark.sql("SELECT * FROM body_clean1 where summary <> \'\'").rdd.map(lambda x: (x[1],x[0]))
print(type(df_lda))
df_lda.collect()

In [22]:
%sql
select count(*) from body_clean1 where summary is null


count(1)
0


In [23]:
%sql select a.*, b.summary as gensim_summary, c.summary as nltk_summary from no_for_or_replies_df a, body_clean1 b, my_nltk_df1 c where a.mid=b.mid and a.mid=c.mid

In [24]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.parsing.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

In [25]:
def lemmatize_stemming(text):
  stemmer = PorterStemmer()
  return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def lemstem_preprocess(text):
  result = []
  for token in gensim.utils.simple_preprocess(text):
    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
      result.append(lemmatize_stemming(token))
  return result

In [26]:
df.count()
test_dict={'mid':68633}
test_df='df'
test_a_row=filter_my_dataframe(test_dict,test_df)
#t=test_a_row.select('body','mid').collect()
#t=test_a_row.select('body')
#t=test_a_row.select('body','mid')
#type(t)
#print(t)
#t=no_for_or_replies_df.select ('body','mid').collect() 
t=spark.sql ("select body from no_for_or_replies_df")
#type(t.collect())
#type(t.take(1))
processed_email =t.rdd.flatMap(lambda x:lemstem_preprocess(clean(str(x)))).collect()
#processed_email =t.rdd.flatMap(lambda x:(lemstem_preprocess(clean(str(x[1]))),x[0])).collect()
dictionary = gensim.corpora.Dictionary(processed_email)
processed_email.take(5)





In [27]:
#test_dict={'mid':68633}
#test_df='df'
#test_a_row=filter_my_dataframe(test_dict,test_df)
#t=test_a_row.select('body','mid').collect()
#t=test_a_row.select('body','mid')
#type(t)
#print(t)
t = df.select('body_cleaned').rdd.map(lambda x:x.asList()).take(5)
print(t)



In [28]:
t=spark.sql ("select body_clean from df").collect()
stemmer = PorterStemmer()
process_email= 
words = []
for word in clean(str(t[0])).split(' '):
    words.append(word)
print(words)
for token in gensim.utils.simple_preprocess(clean(str(t[0]))):
  print(token)
  print(lemmatize_stemming(token))
  
from pyspark.sql import SQLContext, Row
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vector, Vectors

# Loads data.
data = df_lda.rdd.map(lambda (words,idd): Row(idd = idd, words = words.split(" ")))
#data.count()
docDF = spark.createDataFrame(data)
Vector = CountVectorizer(inputCol="words", outputCol="vectors")
model = Vector.fit(docDF)
result = model.transform(docDF)

corpus = result.select("idd", "vectors").rdd.map(lambda x,y: [x,Vectors.fromML(y)]).cache()

# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')
topics = ldaModel.topicsMatrix()
vocabArray = model.vocabulary

wordNumbers = 10  # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))

def topic_render(topic):  # specify vector id of words to actual words
    terms = topic[0]
    result = []
    for i in range(wordNumbers):
        term = vocabArray[terms[i]]
        result.append(term)
    return result

topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()

for topic in range(len(topics_final)):
    print ("Topic" + str(topic) + ":")
    for term in topics_final[topic]:
        print (term)
    print ('\n')


In [29]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)