# Video Summary

#### Adapted from: https://github.com/meta-llama/llama-recipes/blob/main/recipes/use_cases/video_summary.ipynb

In [2]:
# !pip install langchain youtube-transcript-api tiktoken pytube

In [6]:
from langchain.callbacks import get_openai_callback

In [1]:
from langchain.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=5t1vTLU7s40", add_video_info=True
)

In [2]:
# load the youtube video caption into Documents
docs = loader.load()

In [3]:
# check how many characters in the doc and some content
len(docs[0].page_content), docs[0].page_content[:300], len(docs)

(142689,
 "- I see the danger of this\nconcentration of power through proprietary AI systems as a much bigger danger\nthan everything else. What works against this is people who think that\nfor reasons of security, we should keep AI systems\nunder lock and key because it's too dangerous to put it in the hands of e",
 1)

## Language model

In [4]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-4o-mini")

## Simple model

In [7]:
text = docs[0].page_content[:4000]

with get_openai_callback() as cb:
    summary = llm.invoke(f"Por favor, façar um resumo em portugues do texto abaixo: {text}.")
    print(cb)
print(summary)

Tokens Used: 1118
	Prompt Tokens: 890
	Completion Tokens: 228
Successful Requests: 1
Total Cost (USD): $0.0002703
content='O texto discute a concentração de poder em sistemas de inteligência artificial (IA) proprietários, que é vista como um grande perigo. O autor argumenta que manter esses sistemas sob controle restrito, por motivos de segurança, pode resultar em um futuro negativo, onde poucas empresas controlam toda a informação. Ele acredita que as pessoas são fundamentalmente boas e que a IA, especialmente a de código aberto, pode potencializar essa bondade. Yann LeCun, cientista-chefe de IA da Meta e um dos principais nomes na área, defende a abertura no desenvolvimento de IA e critica aqueles que alertam sobre os perigos da inteligência geral artificial (AGI). LeCun acredita que a AGI será criada um dia, mas que será benéfica e não escapará ao controle humano. Ele também discute as limitações dos modelos de linguagem autoregressivos, como o GPT-4 e LLaMA, afirmando que, embora s

# RAG

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# we need to split the long input text
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)

In [20]:
from langchain.chains.summarize import load_summarize_chain

chain = load_summarize_chain(llm, chain_type="refine", verbose=True)

with get_openai_callback() as cb:
    summary = chain.invoke(split_docs)
    print(cb)

In [18]:
from IPython.display import Markdown
Markdown(summary['output_text'])

The original summary is comprehensive and captures the key points made by Yann LeCun regarding the future of AI, the limitations of current models, and the importance of open-source approaches. The additional context provided does not significantly alter or enhance the existing summary, as it reiterates themes of human goodness, the transformative potential of AI, and the need for open-source solutions. Therefore, the original summary remains relevant and effective in conveying LeCun's insights.

**Final Summary:**

Yann LeCun, chief AI scientist at Meta, emphasizes the dangers of concentrating power in proprietary AI systems, advocating for open-source AI to empower individuals. He believes in the fundamental goodness of people and argues that open-source AI can enhance this goodness. LeCun critiques the reliance on autoregressive large language models (LLMs) like GPT-4 for achieving superhuman intelligence, stating they lack essential characteristics of intelligent behavior, such as understanding the physical world, persistent memory, reasoning, and planning. He argues that most of our knowledge is derived from direct interaction with the physical world rather than language, highlighting that sensory input provides a richer and more immediate learning experience. While acknowledging the usefulness of LLMs, he contends they are insufficient for developing human-level intelligence compared to the vast experiential learning of a child.

LeCun also notes that intelligence must be grounded in reality, whether physical or simulated, and that many tasks we take for granted, such as driving or performing household chores, require a level of embodied understanding that LLMs currently lack. He explains that the training process of LLMs, which involves predicting the next word in a sequence based on prior words, limits their ability to engage in deeper reasoning or intuitive understanding of the physical world. This autoregressive approach contrasts with human cognition, where abstract thinking occurs independently of language, allowing for a more nuanced understanding of concepts before they are expressed verbally.

Furthermore, LeCun argues that true understanding of the world requires a sophisticated internal world model, which LLMs do not possess. He asserts that while it is possible to build a world model through prediction, relying solely on language is inadequate due to its low bandwidth. Instead, he emphasizes the importance of observing the world and understanding its dynamics to create a model that can predict future states based on potential actions. He highlights the challenges in developing generative models for video, noting that predicting the distribution of frames in a video is significantly more complex than predicting the next word in a sequence. LeCun points out that the world is far more complicated and rich in information than text, as video represents high-dimensional continuous spaces, making it difficult for models to predict intricate details, such as textures and features in a scene.

He critiques existing methods, such as training systems to reconstruct images from corrupted versions, as largely ineffective for developing robust representations of visual information. Instead, he advocates for alternative approaches, like joint embedding predictive architecture (JEPA), which involves taking both the full and corrupted versions of an image, running them through encoders, and training a predictor to derive the representation of the full input from the corrupted one. LeCun explains that JEPA focuses on extracting abstract representations rather than predicting all pixel details, making it a more efficient method for capturing essential information. He acknowledges that while JEPA is a step towards advanced machine intelligence, it is not a complete solution, as it still requires further development to achieve the level of understanding and reasoning found in human cognition.

LeCun further elaborates that effective learning requires abstraction, where unnecessary details are filtered out to focus on essential features of the environment. He argues that while language provides a level of abstraction, it can also lead to a reliance on simplified representations that may not capture the complexity of the physical world. He warns against prematurely combining visual and language models, as this could lead to superficial improvements without achieving the deeper understanding exhibited by non-linguistic animals. Ultimately, he believes that developing a robust understanding of the world through sensory input is crucial before integrating language into AI systems.

In his exploration of advanced learning techniques, LeCun discusses non-contrastive methods such as BYOL, VICReg, and I-JEPA, which focus on training systems to predict representations of uncorrupted inputs from corrupted ones. He describes how these methods involve corrupting images through transformations like cropping or blurring and then training a predictor to derive the original representation. He also introduces V-JEPA, an extension of I-JEPA applied to video, which masks segments of frames across multiple time steps to learn effective representations of video data. These approaches aim to enhance the system's ability to learn common sense and improve its understanding of the world, ultimately contributing to the development of more sophisticated AI systems.

LeCun further explains that these advanced models can help determine the physical plausibility of video sequences, allowing the system to recognize when objects appear, disappear, or change shape in ways that defy physical laws. He suggests that with modifications, such as predicting future states based on actions taken (like steering a car), these models could eventually support complex tasks like driving. He emphasizes the importance of creating an internal model that can predict outcomes based on actions, which is essential for planning and decision-making. This model predictive control approach allows for the planning of sequences of actions to achieve specific objectives, a capability that LLMs currently lack. LeCun acknowledges that while hierarchical planning is necessary for complex tasks, it requires specific architectural designs to emerge from these foundational models.

In discussing the limitations of LLMs, LeCun illustrates that while they can generate plans and answer questions at a certain level of abstraction, they struggle with detailed, physical actions, such as the precise steps needed to stand up from a chair or navigate complex environments. He emphasizes that LLMs depend on their training data and may produce hallucinated or non-factual responses when faced with unfamiliar scenarios. He explains that the autoregressive nature of LLMs leads to an exponential increase in the likelihood of nonsensical outputs as the model generates more tokens, due to the accumulation of errors. This highlights a fundamental flaw in LLMs, as they can only perform well on prompts they have been trained on, while the vast majority of possible prompts remain unaddressed.

LeCun acknowledges the impressive capabilities of LLMs, particularly in self-supervised learning, which has demonstrated significant advancements in language understanding and translation. However, he maintains that these models still lack the deep understanding of the physical world necessary for true intelligence. He critiques the Turing test as an inadequate measure of intelligence, emphasizing the need to recognize the limitations of current AI systems while acknowledging their impressive capabilities.

LeCun concludes that to achieve human-level AI, it is essential to move beyond generative models and focus on joint embedding and predictive approaches that can better capture the complexities of the real world. He argues that high-level reasoning in LLMs is fundamentally different from the common-sense reasoning required to navigate the physical world, and that a robust understanding of low-level experiences is necessary to build a consistent world model. This foundational knowledge, which LLMs currently lack, is crucial for developing AI systems that can effectively understand and interact with the world. He also notes that the autoregressive nature of LLMs leads to a constant computational effort per token produced, which limits their ability to adaptively allocate resources for more complex queries, further underscoring their limitations in reasoning and planning compared to human cognition.

LeCun further elaborates on the future of AI systems, suggesting that they will differ significantly from autoregressive LLMs. He draws a parallel between human cognitive processes, distinguishing between "system one" tasks, which are instinctive and require little conscious thought, and "system two" tasks, which involve deliberate planning and reasoning. He posits that future AI systems will need to incorporate a planning mechanism that allows them to allocate resources based on the complexity of the task at hand, moving away from the autoregressive prediction of tokens. Instead, he envisions a model that optimizes answers in an abstract representation space, using energy-based models to evaluate the quality of responses. This approach would enable AI systems to engage in more sophisticated reasoning and planning, ultimately leading to a more human-like understanding of the world. LeCun recommends abandoning generative models, autoregressive generation, and probabilistic models in favor of joint embedding architectures and energy-based models. He also suggests minimizing the use of reinforcement learning (RL), advocating for model predictive control as a more efficient alternative, using RL only when necessary to adjust world models or objectives.

In light of recent criticisms of AI systems, such as Google's Gemini 1.5, LeCun addresses the issue of bias in AI. He argues that it is impossible to create an unbiased AI system, as bias is subjective and varies among individuals. He advocates for open-source AI as a solution, emphasizing the need for diverse perspectives in AI development, akin to the principles of free speech and a free press in a democracy. LeCun envisions a future where AI mediates our interactions with the digital world, underscoring the importance of transparency and diversity in AI systems to foster a more equitable and informed society. He stresses that the concentration of AI power in a few companies poses a danger to democracy and local cultures, advocating for open-source platforms that allow diverse groups to fine-tune AI systems for their specific needs, thereby promoting a rich ecosystem of AI applications that reflect varied languages, cultures, and values. LeCun also discusses Meta's business model, suggesting that providing open-source models can benefit the company by allowing businesses to build applications on top of them, ultimately enhancing Meta's offerings and revenue potential without compromising the distribution of foundational models. He further notes that open-source AI enables diversity, allowing for tailored models that can cater to different political and cultural perspectives, which could lead to a more nuanced and effective interaction with technology.

LeCun expresses excitement about the potential for human-level intelligence through advancements in AI, particularly in the context of collaborative efforts with researchers at institutions like DeepMind and Berkeley. He acknowledges the importance of both computational scale and architectural innovation in achieving these goals, while also recognizing the need for hardware improvements to match the efficiency of the human brain. He emphasizes that the journey toward artificial general intelligence (AGI) will be gradual, rather than a sudden breakthrough, and that significant progress is still required in learning representations and understanding the world before reaching human-level capabilities. He cautions that achieving a fully integrated system capable of reasoning, planning, and learning in a hierarchical manner will take at least a decade, if not longer, due to the numerous challenges that remain unaddressed in the field.

LeCun also addresses concerns about the potential dangers of advanced AI systems, arguing against the notion that intelligent systems will inherently seek to dominate or harm humans. He explains that AI systems are not species and do not possess the hardwired desires for dominance found in social species. Instead, he believes that AI can be designed to be submissive to human control, with guardrails in place to ensure safe operation. He acknowledges the complexity of designing these guardrails and emphasizes the need for iterative development to refine AI behavior over time. Drawing an analogy to the evolution of turbojet safety, he suggests that just as engineers have progressively improved turbojet reliability, AI systems can be developed to be controllable and safe through careful design and ongoing adjustments. He also discusses the potential for AI systems to mediate interactions in the digital world, acting as filters to protect users from manipulative or harmful content, thereby enhancing the safety and reliability of AI technologies.

LeCun reflects on the human psychology surrounding new technologies, likening the skepticism towards AI to historical fears of innovations like electricity or trains. He notes that such fears often stem from a natural instinct to protect cultural norms and societal structures from perceived threats. He emphasizes the importance of embracing change and understanding the real versus imagined dangers of new technologies. LeCun reiterates the necessity of open-source platforms to democratize AI development and prevent the concentration of power in a few large companies, which could exploit the technology to the detriment of society. He advocates for a diverse set of voices in AI development to ensure that the technology reflects a wide range of cultural values and perspectives, ultimately fostering a more equitable and informed society. He warns that the concentration of AI power in proprietary systems poses a significant threat to democracy and the diversity of ideas, advocating for open-source solutions to preserve a rich ecosystem of thought and innovation.

LeCun also highlights the complexity of tasks that AI systems need to perform in real-world environments, such as cleaning, cooking, and navigating spaces filled with uncertainty. While current AI systems can perform specific tasks, like navigating to a fridge or picking up objects, they are not yet capable of generalizing these skills to more complex tasks, such as clearing a dinner table. He expresses hope for the future of humanoid robots and their potential to enhance human interaction with AI in physical spaces, allowing for deeper philosophical and psychological exploration of our relationships with robots. He encourages innovative research in areas like self-supervised learning from video and planning with learned world models, emphasizing that significant advancements can be made without necessarily relying on large datasets. LeCun believes that the future of AI will involve hierarchical planning and the ability to plan actions in various contexts, not just physical ones, which remains an area with much room for exploration and development.

LeCun envisions a future where AI amplifies human intelligence, likening it to the transformative impact of the printing press, which democratized knowledge and fostered enlightenment. He believes that AI can enhance human capabilities, allowing individuals to manage a "staff" of intelligent assistants that can perform tasks more efficiently. This potential for AI to improve human decision-making and knowledge dissemination gives him hope for the future, despite the challenges posed by societal divisions and conflicts. He argues that, like the printing press, AI has the potential to elevate humanity, provided it is developed and integrated thoughtfully into society.